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1.   INTRODUCTION 

1.1  Introduction  to  LSI  Design  Constraints 

The  advent  of  large  scale  integration  (LSI)  technology  for  the 
manufacture  of  logic  circuits  has  posed  a  new  challenge  to  the  computer 
system  and  logic  designers.   The  challenge  is  to  find  out  ways  that 
would  make  efficient  use  of  the  full  potential  of  LSI — reliability, 
lower  cost  and  improved  speed — in  the  design  of  digital  computers. 

The  LSI  technology  has  peculiar  constraints  which  have  important 
implications  for  its  effective  use  in  future  systems.   The  constraints 
and  implications  can  be  broadly  classified  into  two  categories: 
external  or  system  level  considerations  and  internal  or  logic  circuit 
level  constraints.   With  LSI,  hundreds  of  logic  functions  can  be 
fabricated  on  subminiature  substrates.   Since  the  initial  development 
cost  is  very  high,  it  is  important  that  a  small  number  of  standard 
elements  be  developed  and  the  initial  cost  of  development,  thus,  gets 
amortized.   However,  designing  universal  elements  of  the  complexity 
offered  by  LSI  is  very  difficult.   A  potential  benefit  of  LSI  that 
has  been  continually  cited  is  an  increase  in  reliability  over  current 
systems.   Since  system  reliability  is  inversely  proportional  to  the 
number  of  module  interconnections,  it  is  important  that  LSI  devices 
should  have  a  high  gate-to-pin  ratio.   But  the  idea  of  universality 
of  LSI  devices  and  high  gate-to-pin  ratio  are  conflicting  in  that  the 
latter  tends  to  give  the  device  a  unique  personality  and  cannot  be 
used  in  a  system  repetitively. 


At  the  internal  level,  designing  logic  for  integration  on  the 
chip  requires  a  reorientation  of  the  relative  values  placed  on  the 
resources  used  to  realize  the  design.   One  of  the  severest  constraints 
in  the  design  of  an  LSI  device  is  the  restriction  on  interconnections 
on  the  chip  itself.   This  is  due  to  the  limitation  both  of  available 
wiring  area,  the  number  of  planes  to  which  all  of  the  wiring  must  be 
confined,  and  a  host  of  other  topological  considerations  which  combine 
to  determine  the  locations  of  candidate  points  for  interconnections. 
Required  wiring  can  be  reduced  by  forcing  the  logic  design  of  the 
chip  into  a  cellular  or  regular  structure.   Regular  structure  has 
very  important  implications. 

a)  It  facilitates  every  step  of  LSI  manufacturing  process  by 
making  it  possible  to  perform  relatively  simple  tasks 
repetitively.   Mask  making  can  be  facilitated  by  the 
repetitive  structure. 

b)  It  is  possible  to  design  and  optimize  a  simple  cell  to 
achieve  most  function  per  dollar,  but  a  large  chip  of  random 
gates  is  impossible  to  optimize  because  of  variables  involved, 

c)  Testing  of  LSI  devices  is  a  major  cost  factor.   The  genera- 
tion of  test  algorithms  for  simple  cells  and  regular  and 
repetitive  structure  is  easier. 

d)  In  addition,  as  the  yield  increases  due  to  technology 
improvements,  larger  devices  can  be  made  out  of  the  same 
simple  cells. 


Another  limitation  of  LSI  which  must  be  considered  in  logic 

design  is  that  of  external  connections.   More  pins  require  more 

external  gates  to  drive  the  capacitance  of  the  external  pins.   It 

causes  an  increase  in  temperature  due  to  increased  number  of  gates 
and  higher  current  required  due  to  a  large  number  of  external  gates. 

This  increases  the  failure  rate  of  the  device. 

Semiconductor  memories  meet  most  of  the  requirements  of  LSI 
technology  and  the  present  use  of  LSI  in  computer  systems  is  in  the 
form  of  these  memories  for  system  enhancement  applications  [1] ,  [2] , [3] . 
These  applications  include  the  use  of  RAMs  as  scratch-pad  memories, 
of  cache  memories  to  reduce  main  memory  requests  and,  thus,  increase 
the  computational  throughput.   Use  of  ROMs  for  microprogramming,  table 
look-up  operations  and  hardwired  subroutines  also  increase  performance 
at  a  relatively  little  cost.   Content  addressable  memories,  queues  and 
stacks  can  greatly  simplify  building  and  maintaining  tables  and 
greatly  reduce  system  overhead  and  software  costs. 

Use  of  LSI  in  the  design  of  central  processor  itself  involves 
proper  logic  partitioning.   Logic  partitioning  involves  organizing  the 
internal  logic  structure  such  that  large  functional  arrays  on  a  chip 
can  be  repetitively  used.   Two  partitioning  methods  are  the  bit-slicing 
and  functional  partitioning.   Bit-slicing  [4] ,  [5]  tends  to  be  system 
dependent  and  not  universal  and,  thus,  is  suitable  for  custom  LSI  only. 
In  functional  partitioning,  the  machine  is  structured  towards  modules 
wherein  each  module  consists  of  a  completely  self-contained  processor 
having  local  storage,  some  processing  logic  and  the  control  necessary 
for  the  module  to  execute  its  function.   Each  module  acts  as  a  small 


insular  unit  of  logic.   The  module  control  sees  only  its  own  state  and 
the  requirements  for  communication  outside  the  module  are  correspond- 
ingly reduced.   An  excellent  example  of  functional  partitioning  is  RCA's 
LIMAC  and  Macromodular  computers  [6] ,  [7] . 

In  this  thesis,  we  report  on  a  study  of  logical  organization  and 
design  of  an  Arithmetic  Unit  which  is  capable  of  performing  four  basic 
operations  of  addition,  subtraction,  multiplication  and  division. 

The  organization  and  design  of  the  Arithmetic  Unit  are  influenced 
by  LSI  technology  constraints  of  modularity,  least  number  of  different 
module  types,  structural  regularity  of  the  module,  limited  pin  count 
and  limited  fan-out  capability. 

In  the  rest  of  this  chapter,  we  first  very  briefly  review  the 
various  proposals  suggested  for  the  Arithmetic  Unit  and  its  LSI  imple- 
mentation.  This  is  followed  by  a  brief  introduction  of  the  model  chosen 
for  study  in  this  research  and  the  scope  and  an  overview  of  the  thesis. 

1.2  Arithmetic  Unit  Structure  and  LSI 

The  proposals  suggested  for  the  architecture  of  LSI  implementable 
Arithmetic  Units  can  be  broadly  classified  into  three  categories;  namely: 
a)   partitioning  of  the  conventional  ALU  which  uses  standard  binary  number 
representation,  b)  two  dimensional  iterative  (cellular)  structures  and 
table  look-up  methods,  and  c)  use  of  number  representations  different 
than  conventional  binary.   It  must  be  noted  that  these  three  categories 
are  not  exclusive  of  each  other  but  are  rather  interrelated.   This  class- 
ification is  used  here  simply  for  ease  of  exposition. 


1.2.1  Partitioning  of  conventional  ALU  -  A  low  performance  and 
parallel  basic  ALU  essentially  consists  of  registers — the  accumulator, 
the  M-Q  register,  and  the  registers  for  parallel  shifting — a  full  adder, 
circuitry  for  complementing  and  shifting  and  some  control  logic  to  co- 
ordinate their  activity  for  arithmetic  and  logic  control.   It  is  necessary 
for  efficient  operation  to  allow  flexible  and  rapid  transfer  of  informa- 
tion from  any  one  register  to  another.   In  binary  arithmetic,  except  for 
the  end  bits,  it  is  possible  to  partition  all  the  circuitry  associated 
with  one  bit  along  with  some  local  decoder  bits  for  gating  functions 
into  one  LSI  cell  [8] .  Thus,  all  the  data  transfer  and  manipulation 
operations  circuitry  can  be  assembled  into  identical  cells  and  provide  a 
fairly  good  gate/pin  ratio.   A  classic  example  of  this  approach  is  the 
Texas  Instrument  LSI  airborne  computer,  the  model  2502.   However,  this 
bit-slicing  approach  breaks  down  for  high  performance,  conventional 
binary  number  system  when  circuitry  for  a  fast  carry  generation  and  prop- 
agation is  added  to  the  ALU.   Raytheon  [9]  has  combined  four  bit  slices 
into  one  LSI  module  so  that  the  look-ahead  logic  could  be  used  on  the 
four  bits  of  this  module.   But  still  this  does  not  provide  the  flexibility 
for  unlimited  carry  look-ahead  with  only  one  type  of  module. 

For  achieving  high  performance,  with  conventional  number  systems,  the 
various  slices  of  the  ALU  work  in  synchro-parallelism  [10]  and  controlled 
by  signals  broadcast  from  the  central  control  logic.   Since  the  control 
functions  are  more  difficult  to  modularize  than  functions  related  to 
data  operations,  micromemory  control  technique  is  used  for  mapping 
the  irregular  and  diverse  algorithms  for  arithmetic  control  into  a 


regular  structure  of  memory.   However,  for  large  word  length,  the  broad- 
casting of  control  signals  is  not  compatible  with  LSI  constraint  of  low 
fan-out  and  neighborhood  connections  only. 

To  overcome  this  problem  of  control  irregularity  and  broadcasting, 
many  combinational  two-dimensional  iterative  structures  (cellular  arrays) 
have  been  proposed  for  multiplication,  division  and  other  arithmetic 
functions  like  square  root,  etc. 

1.2.2  Two  dimensional  iterative  structures  -  Two  dimensional  itera- 
tive structures  are  memory-like  structures  and  admirably  satisfy  the  LSI 
constraints.   From  the  arithmetic  unit  point  of  view,  they  can  be  further 
classified  into  two  sub-categories  of  cellular  arrays  and  table  look-up 
methods. 

1.2.2.1  Cellular  arrays  -  A  cellular  array  is  a  two-dimensional 
iterative  configuration  of  identical  cells,  each  of  which  contains  both 
logic  and  storage  and  is  connected  mainly  to  its  immediate  neighbors. 
Such  an  array,  therefore,  has  the  form  of  a  memory  array  that  is  enhanced 
with  logic  at  each  digit  position.   A  cellular  array  is  a  spatial  analog 
of  the  temporal  sequence  of  steps  of  the  control  algorithm;  i.e.,  the 
cellular  array  performs  the  same  sequence  of  computations  iteratively  in 
space  rather  than  in  time.   The  cellular  arrays  can  be  either  purely  dedi- 
cated exclusively  [11]  to  some  arithmetic  function  or  can  be  programmable 
[12]  so  that  they  can  be  used  by  many  functions.   Since  multiplication 
processes  are  characterized  by  the  basic  algorithm  of  add/no  add 
followed  by  shift,  they  differ  mainly  in  the  interconnection  of  the 
various  cells  in  the  array  for  speeding  up  the  effective  addition 


time  of  the  various  partial  products.   Some  use  tree  adder  structure  [13] 
while  others  use  carry  save  adders  [14]  in  the  basic  cells  to  avoid 
the  carry  propagation  problem  at  every  stage — the  carry  propagation 
occurring  only  at  the  last  stage.   Most  arrays  assume  that  the  operands 
are  either  positive  or  in  the  sign  magnitude  representation  with  the 
sign  of  the  product  being  determined  separately.   A  negative  multiplier 
in  2's  complement  representation  needs  a  correction  to  the  product 
obtained  by  the  simple  "add/no  add  and  shift"  algorithm  and  makes  the 
interconnection  of  the  cells  in  the  array  somewhat  irregular.  A  cell- 
ular array  for  multiplication  has  been  suggested  [15]  which  makes  use 
of  multiplier  recoding  and  a  conditional  adder/subtracter  cell  so  that 
either  addition  or  subtraction  of  the  shifted  multiplicand  to  the 
partial  product  can  take  place.  This  does  not  require  final  correction 
to  the  product.  However,  the  recoding  array  is  structurally  different 
from  the  multiplication  array  and  thus  needs  two  types  of  functional 
arrays  to  generate  the  final  product.  More  recently,  Baugh  and 
Wooley  Q.6]  have  proposed  a  cellular  multiplier  where  the  correction  is 
not  necessary.   Similarly,  the  cellular  array  for  the  division  operation 
uses  the  basic  binary  restoring  or  non-restoring  algorithm  to  produce 
the  quotient.  The  interconnection  structure  for  the  end  cells  used  for 
comparing  the  signs  of  the  divisor  and  the  partial  remainder  is  again 
different  from  the  rest  of  the  cell  interconnection  [17]  .   For  fixed 
point  operations,  a  special  step  may  be  necessary  at  the  end  to  generate 
the  remainder  of  the  same  sign  as  the  dividend.   In  addition  to  the 
dedicated  arrays  for  each  arithmetic  operation,  a  programmable  array 
suitable  for  both  multiplication  and  division  has  also  been  recently 


proposed.   Here  the  most  significant  cell  of  each  row  is  more  compli- 
cated and  acts  both  as  a  multiplier  recoder  cell  and  a  comparator  cell 
for  comparing  the  signs  of  the  divisor  and  partial  remainders,  depending 
upon  the  arithmetic  operation  [18]  .   For  operands  of  large  word  length, 
the  cellular  array  contains  more  cells  than  can  reliably  be  implemented 
on  a  single  silicon  slice,  and  hence  is  subdivided  into  subarrays  which 
are  externally  connected.   Such  an  array  made  up  of  subarrays  will  take 
more  time  to  generate  the  final  result  compared  to  a  fully  iterative 
monolithic  array  on  a  single  silicon  slice.   Cellular  arrays  can  be 
either  synchronous  or  asynchronous  in  operation.   An  asynchronous 
cellular  multiplier  for  vector  or  pipelined  mode  operations  has  been 
proposed  by  Bjorner  [19] . 

For  arrays  using  conventional  number  systems,  the  problem  of  carry 
propagation  along  a  row  of  cells  still  plagues  the  cellular  arrays. 

Thompson  [20]  and  Chen  [10]  have  suggested  using  the  cellular 
array  in  a  diagonally-timed  fashion  such  that  digit  level  pipeline 
takes  place  in  two  dimensions,  giving  a  higher  computational  throughput. 

Although  the  cellular  arrays  do  satisfy  the  structural  needs 
of  logic  circuits,  for  LSI  technology,  a  few  shortcomings  can  be  sum- 
marized as  follows: 

i)   Due  to  the  use  of  conventional  ripple  carry  adder/subtracter 
in  the  basic  cells,  the  total  number  of  cells  needed  is 
always  equal  to  that  necessary  for  a  double  length  product, 
although  in  practice  most  often  the  single  length  product 
only  is  desired.   This  is  due  to  the  fact  that  the  carry 


ripples  from  the  least  significant  end  to  the  most  signifi- 
cant bit  position  and  if  the  cells  that  contribute  to  the 
less  significant  part  of  the  double  length  product  (for 
fractional  mantissas)  do  not  exist,  then  the  most  signifi- 
cant part  of  the  product  will  be  in  error  and  this  error 
becomes  very  acute  for  operands  of  large  word  length. 
These  otherwise  unnecessary  cells  raise  the  cost  of  the 
array. 
ii)   In  the  case  of  cellular  arrays,  as  many  rows  of  adder/ 

subtracter  basic  cells  are  needed  as  there  are  multiplier 
bits  or  quotient  bits.   But  effectively,  the  rows  corre- 
sponding to  "zero"  multiplier  or  quotient  bits  serve  no 
useful  purpose  (except  possibly  shifting) .  These  unneces- 
sary cells  not  only  add  to  the  cost  by  using  large  amounts 
of  silicon  area,  but  they  also  increase  the  probability  of 
faults  on  the  chip  and  make  the  testing  of  the  chip  more 
expensive, 
iii)   For  large  word  length  where  subarrays  have  to  be  externally 
connected,  the  addition  of  other  subarrays  for  expansion  of 
word  length  cannot  be  done  without  extensive  changes  in  the 
external  wiring.   Thus  expandability  of  two-dimensional 
structures  is  poor. 

1*2.2.2  Table  look-up  methods  -  Structural  regularity  of  memories 
makes  them  very  suitable  for  implementation  in  large-scale  integration. 
All  the  logic  and  arithmetic  operations  in  the  machine  can  be  performed 
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by  extensive  table  look-up  operations.  Table  look-up  operations  can 
either  be  done  in  parallel  or  in  a  serial  fashion.   Parallel  operations, 
however,  require  too  large  a  table  for  any  reasonable  word  length  and  are 
out  of  the  question.   However,  tables  for  bit  parallel  and  byte  serial 
operations  can  be  reasonably  implemented  for  arithmetic  operations  like 
addition  and  multiplication  because  the  number  of  words  required  is  re- 
lated to  2   where  n  is  the  operand  width  in  bits.   A  functional  memory 
based  on  an  associative  array  composed  of  writeable  storage  cells  capable 
of  holding  three  states — 0,  1,  and  don't  care — has  been  proposed  by 
Gardner  [21] .  Here  the  logic  is  performed  by  associative  table  look-up 

and  uses  the  "don't  care"  state  to  give  significant  compression  of  the 

2 
tables  over  conventional  two-state  arrays.  Typically,  only  n  to  n  words 

are  necessary  for  functional  memory  instead  of  2   words  for  conventional 
two-state  arrays.   In  fact,  such  a  functional  memory  has  been  suggested 
as  a  nucleus  and  the  building  block  for  the  whole  machine. 

Lee,  et  al.  [22]  and  Crane,  et  al.  [23]  have  proposed  a  distribu- 
ted  logic  memory  structure  which  is  suitable  for  LSI  implementation. 
Although  they  suggested  this  structure  for  nonarithmetical  logic  opera- 
tions, arithmetic  can  be  performed  by  the  bit  serial  table  look-up 
method.   But,  this  method  is  too  slow  when  operated  on  scalar  operands. 
However,  for  vector  operands  the  arithmetic  operation  proceeds  simul- 
taneously in  parallel  on  all  components  of  the  vector  operands,  and  thus 
the  inherent  slowness  of  the  bit  serial  table  look-up  method  is  masked  by 
this  parallelism.   Bit  serial  processing  is  used  in  Goodyear 's  STARAN 
computer  [24] . 
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1.2.3  New  number  system  representation  -  One  of  the  main  obstacles 
to  the  partition  of  currently  existing  arithmetic  processors  which  use  con- 
ventional binary  representation  (radix  complement  or  sign  magnitude)  into 
identical  subunits  is  the  fact  that  the  most  significant  digit  behaves 
differently  than  the  rest  of  the  digit  positions.   Radix  complement/ 
diminished  radix  complement  notation  causes  the  control  and  structure  of 
the  most  significant  and  least  significant  digits  to  be  different  from  the 
rest  of  the  digits  due  to  such  things  as  carry-in  to  the  least  significant 
digit,  end-around  carries  and  special  circuitry  for  logical  and  arith- 
metical shifts.  These  factors  preclude  the  chaining  of  accumulator 
modules  to  any  desired  length.  Moreover,  the  radix  or  diminished  radix 
complement  notation  causes  problems  both  in  the  multiplication  process 
(e.g.,  a  correction  factor)  and  in  the  division  process.   Sign  magnitude 
notation  is  nice  for  multiplication  and  division  because  the  sign  of  the 
result  can  be  readily  determined,  but  addition  and  subtraction  need  a 
complicated  sequential  control  algorithm  for  determination  of  sign  of 
the  result.  All  this  difficulty  can  be  traced  directly  back  to  the 
limitations  imposed  by  the  requirement  for  knowledge  of  sign  and  magni- 
tude of  the  operands  and  the  result.   This  knowledge,  by  definition,  is 
available  on  a  word  level  rather  than  a  digit  level.   So,  it  would  be 
preferable  to  have  a  number  representation  where  each  digit  position 
carries  both  magnitude  and  polarity  information,  unlike  normal  binary 
where  the  most  significant  bit  carries  sign  but  no  magnitude  information 
and  other  bits  carry  only  magnitude  but  no  sign  information.  This  will 
remove  the  above  limitation  since  a  priori  or  a  posteriori  knowledge  of 
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the  operand  or  result  magnitude  and  polarities  is  not  necessary.  This 
will  make  it  possible  to  perform  arithmetic  on  a  digit  ("stage"  in  a 
machine)  basis  rather  than  on  a  number  ("register"  in  a  machine)  basis. 
That  is,  an  arithmetic  operation  on  corresponding  digits  of  a  pair  (or 
more)  of  numbers  would  become  invariant  with  respect  to  the  polarities 
of  two  (or  more)  numbers  in  which  they  are  separately  imbedded.   This 
results  in  two  independent  but  important  implications:   One,  that  a  true 
variable  word  length  operation  is  completely  practicable,  permitting 
modular  construction  in  terms  of  quantity  of  digit  positions;  and  two, 
that  simultaneous  operations  on  multiple  (two  or  more)  operands  are  also 
practicable,  permitting  modular  constructions  in  terms  of  number  of 
operands. 

The  sign  information  with  each  digit  position  in  a  number  can  be 
provided  either  implicitly  or  explicitly.   In  a  positional  weighted 
number  system,  a  negative  radix  implies  [25]  indirectly  a  sign  associated 
with  each  digit  position  (positive  for  odd  positions  and  negative  for 
even  positions,  for  integers).   An  example  of  explicit  sign  information 
with  each  digit  is  the  Avizienis'  signed  digit  number  representation 
[26] .   These  two  approaches  can  be  utilized  to  design  computational 
modules  for  each  digit  position,  which  can  be  used  later  on  to  perform 
arithmetic  either  in  purely  combinational  logic  net  (array)  or  used  with 
a  sequential  control  algorithm.   Shaipov  [27]  and  Prangishvilli  have 
proposed  cellular  arithmetic  arrays  using  a  basic  computation  module 
based  on  minus-two  adder  system.   Avizienis  and  Tung  [28], [29]  have 
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proposed  a  universal  arithmetic  building  element  (ABE)  to  be  used  in 
combinatorial  logic  net  to  perform  arithmetic  functions.   Pisterzi  [30] 
has  utilized  the  explicit  signed-digit  representation  to  design  a  limited 
connection  arithmetic  unit  with  a  central  global  control  which  provides 
the  temporally  sequential  commands  to  the  various  modules  to  achieve 
the  arithmetic  functions. 

Negative  base  number  system,  while  facilitating  the  addition/ 
subtraction  and  multiplication  processes  at  a  digit  level,  makes  the 
division  process  very  complicated.   In  any  restoring  or  non-restoring 
division  algorithm  [31] ,  [32] ,  [33] ,  the  signs  of  the  partial  remainder  and 
the  divisor  are  very  essential  and  the  negative  base  number  representation 
does  not  lend  itself  for  easy  determination  of  sign  of  an  operand  because 
one  has  to  go  through  a  counting  process  to  know  whether  the  most 
significant  digit  of  the  integer  representation  is  in  an  even  position 
or  an  odd  position.   Further,  for  faster  addition/subtraction,  one 
still  needs  the  carry  look-ahead  circuits. 

Avizienis'  proposed  number  representation,  besides  being  signed- 
digit,  is  also  redundant;  i.e.,  each  digital  position  can  have  more 
than  r  values  where  r  is  the  radix  of  the  number  representation.  This 
number  system  has  many  desirable  features,  namely, 

i)   The  algebraic  value  z  of  the  number  Z  composed  of  n  +  m  +  1 

digits  (z   . . .z  .  z.  .  z, . . ,z  )  is  given  by  the 
6      -n    -1  0    1    nr    6      ' 

expression: 


m 
Z  =  I      z  r 

i=-n  1 


-i 
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ii)   Algebraic  value  Z  =  0  if  and  only  if  all  z.  =  0. 
iii)   The  sign  of  the  algebraic  value  Z  is  given  by  the  sign  of 
the  most  significant  (left-most)  nonzero  digit, 
iv)   To  form  the  representation  of  the  additive  inverse  -Z,  the 

sign  of  every  nonzero  digit  z.  is  changed  individually, 
v)   The  addition  and  subtraction  of  two  signed-digit  operands 
Z  and  Y  satisfy  s  =  f(z  ,  y  ,  z .  -  ,  y.,-.)  for  all  posi- 
tions i,  where  s.  are  digits  in  the  representation  of  the 
sum  or  difference  S  =  Z  +  Y.   This  means  that  there  are  no 
carry  propagation  chains  in  signed-digit  additions  (or 
subtractions) . 
vi)   The  same  logic  that  is  used  for  adding  two  numbers  (maxi- 
mally redundant)  can  be  used  to  convert  the  number  from 
conventional  binary  representation  to  the  signed-digit 
format . 
vii)   It  allows  limited  inspection  of  partial  remainder  digits 
to  determine  the  quotient. 
The  properties  (iv)  to  (vi)  make  the  signed-digit  redundant  number 
representation  very  suitable  for  digit-wise  operation  of  the  arithmetic 
unit.   Property  (iii)  obviates  the  need  for  any  complement  arithmetic 
operations. 

Based  on  the  above  number  representation,  Avizienis  [28]  proposed  his 
Arithmetic  Building  Element  (ABE)  which  has  the  capability  of  adding  two 
digits  of  the  two  operands  to  be  added,  forming  the  product  of  two  digits 
and  also  of  forming  a  sum  of  many  digits,  one  of  each  different  operand, 
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besides  having  the  capability  of  achieving  logical  operations  on  in- 
dividual digits.   He  proposed  this  element  for  use  in  combinational 
arrays  for  forming  the  product  of  two  numbers.   But  since  the  ABE  can 
form  a  sum  of  only  m  <_  r+1  digits,  it  becomes  necessary  to  partition 
the  product  of  two  numbers  where  the  multiplier  is  greater  than  r+1 
digits  long  into  groups  of  r+1  digits  so  that  the  same  kind  of  ABE  can 
be  used  to  form  the  whole  product.   Secondly,  the  proposed  ABE  is  too 
complex  for  any  reasonable  radix  r.  Thirdly,  the  combinational  net 
for  the  division  process  is  very  complex  and  expensive. 

For  digit-wise  arithmetic  operations,  mention  should  also  be  made 
of  residue  number  representation  [34]  which  allows  addition/subtraction 
and  multiplication  on  a  digit  basis.   However,  handling  of  overflow/ 
underflow,  conversion  of  conventional  binary  representation  to  residue 
number  representation,  and,  of  course,  the  division  process  are  very 
complicated,  and  that  is  why  not  many  computers  have  been  built  based 
on  this  number  representation.  Moreover,  because  the  moduli  for  each 
digit  position  is  different,  all  the  digit  modules  are  necessarily 
different  and  not  compatible  with  fewer  module  type  constraints  of  LSI. 

1.3  Present  Work 

The  goal  of  the  present  work  is  to  formulate  a  set  of  desirable 
characteristics  for  an  LSI  implementable  Arithmetic  Unit  capable  of  the  four 
basic  operations  of  Addition,  Subtraction,  Multiplication  and  Division, 
to  choose  a  suitable  system  and  logical  organization  which  comes  close 
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to  meeting  these  desirable  properties  and  finally  to  study  the  arithmetic 

and  logic  design  of  the  arithmetic  unit. 

From  our  discussion  in  Sections  1.1  and  1.2,  the  following  set  of 

characteristics  for  the  Arithmetic  Unit  are  considered  suitable  for  its 

implementation  in  LSI  or  any  batch  fabrication  process  technology. 

i)   The  arithmetic  unit  should  be  partitionable  on  a  bit  slice 
or  digit  slice  (for  higher  radix)  basis  which  means  that  we 
should  be  able  to  perform  calculations  on  a  digit-by-digit 
basis.   All  the  digit  processing  modules  should  be  identical 
so  that  a  variable  word  length  can  be  accommodated, 
ii)   Purely  combinational  cellular  arrays  are  too  expensive  for 
large  operand  lengths,  especially  when  each  cell  is  rather 
complex.   Hence,  the  arithmetic  function  execution  should 
be  done  by  a  time  sequence  of  microinstructions.   Further, 
to  achieve  a  balance  between  the  high  cost  of  a  purely  com- 
binatorial array  and  the  slow  speed  of  completely  sequential 
execution  of  microinstructions,  some  form  of  pipeline  struc- 
ture should  be  employed  so  that  when  an  arithmetic  expression 
is  evaluated,  the  various  arithmetic  operations  can  be  over- 
lapped . 
iii)   To  avoid  fan-out  problems  in  case  of  large  operand  lengths, 
the  various  modules  should  have  limited  intercommunication 
with  each  other, 
iv)   Each  processing  module  should  have  local  control  and  be 
autonomous  as  far  as  possible  so  that  only  a  few 
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microinstructions  need  to  be  issued  by  a  central  control  to 
the  modules,  instead  of  a  large  number  of  separate  control 
signals.   This  would  cut  down  the  number  of  external  leads 
necessary  on  each  module. 
v)  The  various  microinstructions  should  be  as  simple  as  possible, 
vi)   Each  processing  module  must  be  consistent  with  the  constraints 
of  large  scale  integration  insofar  as  total  external  pin  count 
in  the  module  is  concerned  and  the  module  itself  should  prefer- 
ably be  made  up  of  cells  (identical  logic  repeated)  when  the 
cell  consists  of  many  many  gates. 
vii)   Since  the  divide  process  by  its  very  characteristic  has  to 

examine  the  most  significant  digits  of  the  operands  (dividend/ 
partial  remainder,  and  divisor)  for  the  calculation  of 
quotient,  the  multiplication  and  addition/subtraction  should 
also  be  performed  as  a  right-directed  process. 
This  most  significant  digit  first  approach  is  consistent  with  other  arith- 
methic  processes  of  operand  normalization,  mantissa  overflow  determination 
and  the  determination  of  the  sign  of  the  result  because  these  processes 
inherently  require  the  examination  of  the  most  significant  digits  of  the 
operands  to  determine  what  additional  processing  is  necessary. 

Many  of  the  characteristics  mentioned  above  are  met  by  an  Arithmetic 

Unit  structure  proposed  by  Pisterzi  [30] .   The  Arithmetic  Unit  consists  of 

t 
modular  processing  elements  called  the  Digit  Processing  Units  (DPUs)   and 

a  global  control  module  called  Primitive  Control  Unit  (PCU) .   The  PCU 


DPU  and  PCU  are  the  terminology  used  by  Pisterzi  [30] .  In  the  present 
thesis,  the  terms  PE  and  MCU  will  be  used  for  the  Processing  Element  and  the 
global  control  module  respectively. 
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does  not  broadcast  control  signals  to  all  the  DPUs  but  instead  the  PCU 
communicates  only  with  the  most  significant  DPU  as  far  as  the  issuance 
of  microinstructions  is  concerned.  The  first  DPU  executes  each  instruc- 
tion and  then  passes  it  on  to  the  second  DPU  which  again  executes  this 
microinstruction  and  further  passes  it  down  to  the  next  DPU  and  so  on. 
Thus,  a  sort  of  pipeline  of  microinstructions  is  established  where  the 
same  sequence  of  microinstructions  is  executed  in  each  DPU. 

A  simplified  block  diagram  of  such  an  arithmetic  unit  is  shown  in 
Figure  1.1. 


PCU 


DPU, 


DPU, 


DPU      . 
n-1 

■ 

DPU 
n 

Figure  1.1  Block  Diagram  of  a  Basic  Model  of  Limited 
Connection  Arithmetic  Unit 

The  present  study  concentrates  on  the  design  of  the  essential  micro- 
instructions necessary  for  performing  four  basic  arithmetic  operations 
in  such  a  structure,  the  logic  design  of  the  Processing  Element,  the  method 
of  communication  between  the  Processing  Elements  and  the  Data  Main  Meirory 
for  fetching  and  storing  operands  and  results.   The  major  part  of  this 
thesis  reports  on  the  logic  design  of  the  Processing  Element  and  identi- 
fies those  parts  of  the  Processing  Element  whose  gate  and  pin  complexity 
are  a  function  of  the  bit  width  of  the  Processing  Element.   This,  in  turn, 
allows  us  to  choose  a  suitable  bit  width  for  the  processing  module  con- 
sistent with  the  technology  constraints  and  also  to  balance  the  costs 
for  the  processing  logic  and  the  control  logic  of  the  Processing  Element. 
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Chapter  2  describes  briefly  the  system  and  logical  organization  and 
mode  of  operation  of  the  Arithmetic  Unit.   The  major  emphasis  in  this 
chapter  is  on  the  logical  structure  of  the  Mantissa  Processing  Logic 
(MPL) ,  the  method  of  communication  between  the  modules  of  the  MPL  and  the 
flow  of  microinstructions  through  them.   This  discussion  provides  the 
necessary  perspective  for  the  material  in  later  chapters.   The  flow  of 
microinstructions  in  the  MPL  is  illustrated  by  a  generalized  example. 
The  chapter  concludes  with  the  definition  and  a  brief  description  of  a 
set  of  basic  and  elementary  microinstructions  which  are  sufficient  to 
execute  the  'machine'  arithmetic  instruction  like  Add,  Multiply  of  two 
operands. 

Chapter  3  treats  the  arithmetic  design  of  the  Processing  Element — 
the  basic  module  of  the  Mantissa  Processing  Logic.   The  arithmetic  design 
is  described  in  terms  of  the  implications  of  the  particular  structure  of 
the  Mantissa  Processing  Logic  on  the  required  characteristics  of  the 
number  system,  the  number  representation  and  the  definition  of  a  normal- 
ized number.   Finally,  in  Section  3.6  which  is  the  major  portion  of  this 
chapter,  we  develop  the  definition  and  operational  specification  of  a  set  of 
five  simple  arithmetic  microinstructions.   These  microinstructions  cause 
an  arithmetic  transformation  of  the  data  and  are  specified  as  such  by  an 
arithmetic  transfer  function,  wherever  possible.   The  digit  algorithm 
for  each  arithmetic  microinstruction  is  also  given. 

In  Chapter  4,  which  is  the  largest  chapter  of  this  thesis,  the  logic 
design  of  the  major  components  of  the  Processing  Element  is  given.   The 
major  components  are  the  register  file  for  storage  of  active  operands, 
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the  Combinational  Network  for  processing  and  the  Control  which  generates 
control  signals  to  condition  the  Combinational  Network.   This  chapter 
also  describes  the  actual  format  and  code  assignment  for  the  twelve  types 
of  microinstructions  executed  in  a  Processing  Element.   Finally,  the 
logic  complexity  of  the  Processing  Element  is  calculated  in  terms  of  the 
total  number  of  gates  and  external  leads  required  in  the  Processing 
Element  module  as  a  function  of  the  bit  width  of  the  module  and  the 
redundancy  ratio  of  the  multiplier  and  quotient  digit. 

Chapter  5  describes  how  the  Mantissa  Processing  Logic  and  the  Data 
Main  Memory  may  communicate  to  fetch  and  store  operands  through  an  inter- 
face whose  behavior  is  somewhat  analogous  to  that  of  a  cache  memory. 

In  Chapter  6,  we  show  how  the  various  microinstructions  can  be 
combined  into  a  sequence  to  be  executed  by  the  Processing  Element  modules 
to  perform  a  'machine'  arithmetic  instruction  like  Floating  Point  Add, 
Multiply,  etc. 

Summary  and  conclusions  are  given  in  Chapter  7. 

Two  appendices  are  included.   Appendix  A-l  gives  the  algebraic  design 
of  a  digit  recoder  which  changes  the  redundancy  ratio  of  the  digit  from 
unity  to  <_  2/3.   In  Appendix  A-2 ,  we  calculate  the  number  of  radix-2 
digits  of  the  truncated  operands  that  are  necessary  in  the  model  division 
to  determine  one  radix-2   quotient  digit. 
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2.   ORGANIZATION  AND  OPERATION  OF  ARITHMETIC  UNIT 

2.1  Introduction 

In  order  to  put  the  discussion  in  the  following  chapters  in  proper 
perspective,  a  brief  description  of  the  logical  organization  and  method 
of  performing  the  processing  is  given.   The  method  of  processing  is 
illustrated  by  an  idealized  example  in  Section  2.4.   The  chapter  closes 
with  an  introductory  description  of  the  repertoire  of  only  the  essential 
microinstructions  which  are  executed  by  the  processing  logic. 

2.2  Organization  of  the  Arithmetic  Unit 

In  Figure  2.1  is  shown  the  global  block  diagram  of  the  arithmetic 
unit.   It  consists  of  Mantissa  Processing  Logic,  Exponent  Processing 
Logic,  Local  Operand  Memories  (LOMM,  LOEM)  and  an  Arithmetic  Control 
Unit.  The  Arithmetic  Control  Unit  (ACU)  consists  of  three  parts — the 
Global  Arithmetic  Control  Unit  (GACU) ,  the  Mantissa  Control  Unit  (MCU) 
and  an  Exponent  Control  Unit  (ECU) . 

The  GACU  acts  as  the  interface  between  the  arithmetic  unit  and 
the  rest  of  the  computer.   It  receives  the  arithmetic  instructions  from 
the  central  control  of  the  computer,  decodes  them,  and  causes  the  Local 
Operand  Memory  Control  (LOMCO)  to  fetch  the  ncessary  operands  from  main 
memory,  if  they  are  not  already  present  in  LOMM.  LOMCO  provides  the 
LOMM  address  of  the  operands  to  the  GACU  which  then  issues  the  necessary 
commands  to  the  ECU  and  MCU  for  exponent  and  Mantissa  processing  and 
coordinates  their  actions.  After  the  processing  is  complete,  it  informs 
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the  central  control  its  status  along  with  any  exceptional  conditions 

if  necessary,  that  may  arise  during  execution  of  the  instruction. 

The  MCU  converts  the  commands  received  from  the  GACU  into  necessary 

microinstructions  to  be  executed  by  the  Mantissa  Processing  Logic.   For 

example,  the  Multiply  command  is  converted  into  a  series  of  shift  left 

multiplier,  form  multiple  and  add,  and  shift  left  accumulator.   Also, 

it  contains  the  overflow  recoder  logic  and  quotient  determination  logic, 

etc. 

The  ECU  performs  the  necessary  control  for  exponent  arithmetic  such  as 

calculating  the  difference  of  the  exponents  for  addition  and  subtraction 
arithmetic  instruction,  sum  of  the  exponents  for  the  multiplication  instruc- 
tion and  detecting  exponent  overflow  and  underflow  conditions. 

In  this  thesis,  we  shall  be  concerned  mainly  with  the  detailed 
design  of  the  Mantissa  Processing  Logic  and  its  communication  with  the 
Local  Operand  Mantissa  memories.  The  detailed  design  of  GACU,  MCU  and 
ECU  is  beyond  the  scope  of  this  research.  The  next  section  describes 
the  logical  organization  of  the  Mantissa  Processing  Logic  and  a  descrip- 
tion of  the  method  of  processing. 

2.3  Organization  of  Mantissa  Processing  Logic 

The  Mantissa  Processing  Logic  consists  of  a  linear  cascade  of 
identical  Processing  Elements  (PEs) .   Each  PE  is  a  complex  logical 
module  and  contains  logic  to  perform  the  various  microinstructions, 
issued  by  the  Mantissa  Control  Unit  (MCU),  in  cooperation  with  other  PEs. 
The  MCU  communicates  only  with  the  most  significant  PE  (closest  to  the 
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MCU)  and  the  microinstructions  flow  serially  (in  a  pipelined  manner) 
from  the  most  significant  PE  to  the  least  significant  PE. 

Figure  2.2  shows  the  schematic  organization  of  the  Mantissa  Process- 
ing Logic  along  with  the  MCU.   This  figure  also  shows  an  End  Unit  which 
is  optional  and  not  intrinsically  necessary  for  the  arithmetic  process- 
ing.  The  End  Unit  allows  the  last  PE  to  be  identical  to  all  the  other 
PEs  as  far  as  interface  is  concerned,  thus  causing  it  to  operate  as 
though  it  had  another  PE  to  its  right.  Moreover,  it  could  contain  some 
logic  in  which  the  operand  digits  shifted  off  the  right  end  could  be 
temporarily  stored  for  improving  the  accuracy  of  the  result  [35] . 

The  PEs  collectively  contain  the  fractional  (Mantissa)  parts  of 
all  active  operands,  one  digit  in  each  PE,  as  shown  in  Figure  2.3. 
Because  the  quotient  generation  and  operand  normalization  processes 
require  the  examination  of  most  significant  digits,  the  operands  are 
placed  in  the  PEs  so  that  the  digits  of  each  of  the  operands  are  avail- 
able to  the  microinstructions  in  order  of  decreasing  significance.  Thus, 
the  most  significant  digits  of  the  active  operands  are  placed  in  the  PE 
which  communicates  with  the  MCU. 

Each  PE  performs  the  same  sequence  of  microinstructions.   A  given 
microinstruction  is  not  executed  by  all  PEs  in  synchro-parallelism  but 
rather  must  be  executed  by  them  in  sequence  (i.e.,  first  by  PE, ,  then 
PE„,...).   Note  that  this  is  different  from  a  conventional  pipeline 
organization  in  which  data  flows  in  sequence  through  a  number  of  stages 
which,  in  general,  do  different  operation  on  the  data.   In  this  organi- 
zation, however,  data  is  relatively  constant  and  flowing  microinstructions 
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where  r  is  the  radix. 


Figure  2.3  The  Distribution  of  Operands  Digits  in  the  PEs  of 
Mantissa  Processing  Logic. 
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tell  a  PE  what  operation  to  execute  on  the  data  resident  in  that  PE  at 
that  instant  of  time. 

During  processing,  each  PE  physically  communicates  only  with  its 
immediate  neighbors.   To  execute  a  microinstruction,  a  given  PE  may  need 
information  from  its  right  neighbor.   This  information  logically  may 
depend  on  the  contents  (active  operand  digits)  of  its  neighboring  PEs, 
depending  on  the  nature  of  the  microinstruction.   So  we  may  say  that 
each  PE  physically  communicates  with  only  one  PE  to  its  immediate  right 

but  from  a  logical  viewpoint,  the  PE  communicates  with  more  than  one 

t 
PE.    In  the  following  discussion  of  the  mode  of  processing,  when  we 

talk  about  information  required  by  a  PE,  from  its  right  neighbors,  we 
mean  the  information  requirement  in  the  logical  sense. 

As  mentioned  earlier,  a  given  microinstruction  is  executed  by  PEs 
not  in  synchro-parallelism  but  rather  in  sequence.  As  soon  as  all  the 
PEs  (say  a.)  which  contain  information  required  by  PE.  to  perform  micro- 
instruction j+1  (referred  to  as  u  -.)  have  executed  u.  and  have  sent  the 
required  information  to  PE-  ,  u  .  may  be  performed  by  PE..  .   The  micro- 
instructions, executed  by  PEs,  are  defined  in  such  a  way  that  they  have 
regular  data  requirements  independent  of  the  position  of  the  PE  in  which 

a  microinstruction  is  executed  so  that  as  each  additional  PE  executes  u., 

tt 
one  more  PE  may  execute  u  , .    The  microinstructions  may  be  viewed  as 


The  logical  communication  could  be  converted  into  physical  communi- 
cation by  duplicating  the  necessary  hardware  logic  in  the  PE  where  that 
information  is  required  but  this  would  increase  the  number  of  intercon- 
nections. 

There  is  an  exception  to  this  rule  in  the  case  of  Assimilation  Recode 
(AR)  microinstruction  in  which  case  the  a^  is  variable  and  depends  on  the 
nature  of  the  data  resident  in  the  PEs.  This  is  further  explained  later 
in  Section  4.3.2.3.3. 
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flowing  through  successive  PEs.   Clearly,  the  PE  registers  do  not  contain 
entire  operands  as  long  as  any  of  the  PEs  are  actively  executing  micro- 
instructions.  Each  PE  contains  the  digits  from  the  results  of  the  last 
microinstruction  executed.   (In  the  worst  case,  if  there  are  n  PEs  and 
each  PE  has  the  capability  of  storing  n  active  operands,  there  could  be 
n  active  operands  in  different  stages  of  processing  if  there  is  a 
sequence  of  n  load  or  store  microinstructions.) 

2.4  Formal  Description  of  Processing  in  a  PE 

The  processing  performed  by  the  PEs  can  be  described  by  the  follow- 
ing: 


Let 


j  *i   -  ■  *j  <j-A.  j'i-r  jW  c2-1) 


.  F.  =  (J).  (.   X.,  .F.   )    and                          (2.2) 

j   l  Yj  j-1  l  j  l-l 

.  G  =  r.  (,F   ,  .   X  ,  .   X   ,...,.. X    )            (2.3) 

3  k  j  j  k-1  j-1  k'  j-1  k+1    'j-1  k+a  / 


where 


.X.     is  the  operand  information  contained  in  the  i-th  PE  immedi- 

ately  following  the  execution  of  u . .   It  consists  of  the 

i-th  digit  of  each  of  the  active  operands,  y.  represents  the 

j-th  microinstruction, 

\l> .  is  the  f  inction  employed  to  obtain  the  new  operand  set  and 

is  dependent  on  the  microinstruction  to  be  performed, 

.F.     is  a  'modifier'  value  which  PE .  transmits  to  PE . , ,  with 
j  l  l  l+l 

the  microinstruction  j,  to  be  performed  next, 
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<j> .     is  the  function  which  each  PE   performs  to  determine  .F., 

J  J 

r.     is  the  function  PE,  employs  to  determine  .G.  , 
j  k  j  k' 

.G.      is  the  value  which  PE   transmits  to  the  PE  executing  m,, 
j  k  k  j 

and 
a.     is  one  more  than  the  number  of  PEs  which  must  logically 
cooperate  with  the  right  neighbor  of  PE  performing  y   in 


order  to  generate  the  necessary  .G  . . 


The  information  .G  is  generated  in  a  time  sequential  fashion.    G  con- 

"]     K  _  I     K. 

011 

sists  of  a.    components    .G    ,    .G    ,...,. GJ        and   they  are  given  by  the 
3  J    k     j    k  j    k 

following  relations. 

,G?      =    r°  (,f,   .,   .  X) 

3    k  j      3    k-1'    j-1  k 


(2.4) 


A     -    ']  ^Fk-r  3-A-  3C1)  • 


a.-l  a.-l  _^  Q  a. -2 

.G.  =r    .      (.F.     ,,        ^   ,    ,G.  ...••■•  .G.  . .    ) 

J    k  3        3    k-1'    j-1   k'    j   k+1'         'j   k+1 

0  1  a.-l 

,  G.  =      I  .G.  ,     .  G.  ,  .  .  . ,  .G,  j 

J    k  j    k'    j   k'         'j   k 


The  superscript  on  ,G,  indicates  the  time  order  of  sequential  generation 
of  G-information. 

Another  formulation  which  is  applicable  for  only  fixed  value  of  a. 
is  given  by  Pisterzi  [30] .   In  this  formulation,  the  PE  executing  micro- 
instruction y .  gets  G-information  directly  from  a   PEs  to  its  immediate 
right.   The  trade-off  between  the  two  is  that  the  former  needs  less  con- 
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nections  to  PE^^  and  also  less  logic  in  PE.  since  the  G-information  is 
developed  in  a  distributed  fashion  in  the  a.  PEs.   However,  this  is 
obtained  at  the  expense  of  more  complex  control  and  longer  time  delay. 
The  operation  of  a  typical  PE,  PE.  say,  is  as  follows.   It  begins 
in  a  state  in  which  it  is  receptive  to  information  defining  the  next  micro- 
instruction to  be  performed.  PE.  receives  this  information  (microinstruction) 

and  the  value  of  .F.  ,  from  its  left  neighbor  PE.  ...  Then  PE.  determines  .G.— 
j  l-l  b  l-l         l  j  i 

the  information  required  by  PE.  ■,    to  complete  microinstruction  y..   .G. 
is  determined  sequentially  as  described  by  the  set  of  relations  in  (2.4). 
The  component  ,G  is  developed  immediately.   At  the  same  time,   PE.  deter- 
mines .F.  by  performing  equation  (2.2) (which  incidentally  is  the  same  as 

.F.  .  in  most  cases)  and  transmits  the  identity  of  y.  along  with  .F.  to 
J  i-1    J  J  i 

PE.,..   At  this  time,   PE.,.  generates  !GJ,^    and  transmits  it  back  to  PE. 
l+l  l+l  &         j  i+1  l 

so  that  PE.  may  generate  .G..  Simultaneously,  it  (PE  . )  transmits  the 

identity  of  y.  instruction  along  with  ,F.,.  to  PE.,.  which  repeats  the 
1  j  i+l   -i    i+2 

0 
same  process.   Note  that  the  information  ,G.    depends  on  ,G.,    which 

ill  l+a . 

J 

must  trickle  back  to  PE..   Although  this  takes  quite  some  time,  the 

a.-l 
.G.T,   can  be  generated  by  PE.,,  just  one  time  step  later.   Initial  setup 

time  is  large,  however, 

a 

As  soon  as  PE.  transmits  ,G.   to   PE.  n,   PE.  .  can  complete  the 
l  j  l        l-l     l-l        r 

execution  of  microinstruction  y,.   After  some  time,   PE.  receives  a 

J  i 

signal  from  PE,  ..  indicating  that   PE.  .  has  executed  y..    PE.  then 
i-1  l-l  1      l 

a.  J 

executes  y.  (the  necessary  .G  ~_  being  ready  by  now).    PE  now  transmits 

a  signal  to  PE.,,  which  indicates  that   PE,,,  may  execute  y..   When  PE, 
i+l  i+l  j  i 

receives  an  acknowledgement  from  PE   ,  it  goes  into  a  state  where  it  is 
receptive  to  information  concerning  y    .   The  sequence  above  then  repeats. 
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2.5  Generalized  Example 

To  illustrate  how  the  processing  of  several  microinstructions  may 
take  place  concurrently  in  the  Mantissa  Processing  Logic,  each  by  a  dif- 
ferent PE,  we  describe  below  a  generalized  example.   This  example  is 
borrowed  from  Pisterzi  [30]  but  the  necessary  changes  have  been  made  to 
conform  to  our  notation. 

Table  2.1  shows  the  a.  for  the  various  microinstructions  for  the 

Table  2.1 
a.  of  the  Microinstructions 


of  the  Example  of  Figure  2.4. 


j 

12        3        4        5        6 

J 

2        10        12        0 

generalized  example.   The  Mantissa  Processing  Logic  will  have  five  PEs 

and  one  operand.  This  operand  will  be  indicated  as  composed  of  five 

digits  a.,...,  a,  such  that  digit  .a.  is  the  digit  contained  in  PE 

after  the  j-th  microinstruction. 

The  operation  of  the  Mantissa  Processing  Logic  is  presented  in  a 

tabular  form  in  Figure  2.4.  The  columns  labeled  0  will  indicate  the 

operand  contained  in  PE  .  The  occurrence  of  a.  in  the  i-th  operand 

column  will  indicate  that  .a.  has  just  been  computed  and  placed  in  the 

operand  register  of  PE  .   The  columns  labeled  IR.  will  indicate  the 

microinstruction  being  executed  by  PE .  and/or  the  G-information  being 

produced  by  PE . .   The  occurrence  of  ^  in  the  IR.  column  will  be  used  to 
i  j  i 
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Figure  2.4  Illustration  of  the  Execution  of  the 
Generalized  Example  in  Mantissa 
Processing  Logic. 
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denote  that  PE,  has  just  received  the  identity  of  j-th  microinstruction 

and  will  begin  determining  G.  in  a  time  sequential  fashion.   The  appearance 

of  ,G.  in  the  IR.  column  indicates  that  PE.  has  lust  determined  the  A-th 
j  i  i  i 

component  of  G   information  which  is  needed  by  PE.  -, .   A  ranges  from  0 
to  a  -1.   (In  our  example,  0  <_  A  <_1.)   The  occurrence  of  MJj)  will  repre- 
sent that  the  execution  of  microinstruction  y  has  just  been  completed  by 
PE.   (and  the  result  operand  digit  a  has  been  generated,  as  indicated 

by  the  appearance  of  .a.  in  column  0  ).  The  progression  of  time  will  be 
indicated  by  the  rows,  each  row  equivalent  to  the  time  required  by  a  PE 
to  execute  one  step  of  processing. 

Figure  2.4  shows  the  Mantissa  Processing  Logic  in  steady  state  at 
time  0.   No  microinstructions  are  being  executed  and  the  operand  A_ 
(  a..  ,na9,  a  ,  a,  ,  a  )  is  in  the  operand  register.   The  processing  proceeds 
as  follows. 

We  assume  that  at  time  3,  the  identity  y   of  microinstruction  1  has 
reached  PE-.   PE  calculates  G~  and  sends  it  to  PE   (Figure  2.5a). 
At  time  4,   PE  calculates  G2  and  sends  it  to  PE  .   At  time  5,  all  the 
G-information  required  for  execution  of  y   (a  =2)  is  available  in  PE^  and 
y.  is  executed  by  PE  .   This  causes  ^a..  to  be  replaced  by  .a  .   During 
the  next  four  time  intervals,  y..  is  performed  consecutively  by  each  of 
the  remaining  PEs  since  -G,  becomes  available  just  as  it  is  required  by 
PE,  «   to  perform  y  .   The  identity  y„  of  the  second  microinstruction  is 
received  by  a  PE,  one  time  unit  after  that  PE  performs  y  .   Since  PE- 
requires  „G„  to  execute  \\~    (ou=l) ,  this  microinstruction  is  not  performed 
by  PE  until  time  8,  one  time  step  after  PE  is  able  to  determine  this 
value  and  send  it  to  PE   (Figure  2.5b).   Just  as  with  y  ,  y_  is  executed 
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sequentially  by  each  of  the  remaining  PEs  during  each  of  the  next  four 

time  intervals.  Microinstruction  y_  is  performed  by  each  of  the  PEs  one 

time  unit  after  each  PE  has  performed  y2  because  a_  ■  0  and  no  outside 

information  is  required.  The  other  microinstructions  are  performed  in 

the  same  pattern. 

In  general,  PE.  performs  y.,  2a  +  1  time  units  later  following  the 

execution  of  y,  ,.   The  time  T„   elapsed  between  the  instant  when  the 
j-1  Em 

identity  of  the  first  microinstruction  reaches  PE  and  the  instant  of 
execution  of  the  m-th  microinstruction  (of  a  set  of  consecutively 
issued  microinstructions)  by  the  first  PE.  is  given  by 


m 
T,,   =   T   2a.  +  m. 
Em     j-1   J 


2.6  The  Micro-Instruction  Repertoire  of  the  PEs 

In  this  section,  we  will  discuss  briefly  the  microinstructions 
which  are  executed  by  the  PEs  so  that  the  overall  arithmetic  unit  is 
able  to  do  addition,  subtraction,  multiplication,  division  and  normali- 
zation. The  microinstructions  may  be  broadly  categorized  in  five 
classes  for  the  purposes  of  this  discussion.  These  five  classes  are: 

1.  the  inter-register  transfers, 

2.  the  shift  microinstructions, 

3.  the  arithmetic  microinstructions, 

4.  the  memory  accessing  microinstructions,  and 

5.  the  miscellaneous  microinstructions. 
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2.6.1  The  inter-register  transfer  microinstructions  -  These  micro- 
instructions cause  operands  to  be  transferred  from  one  internal  register 
of  the  PE  to  another  internal  register.  There  are  two  instructions  in 
this  class:   Transfer  Direct  (TD)  and  Transfer  Invert  (TI) .   The  micro- 
instruction TD  moves  the  contents  of  one  register  in  the  PE  to  another 
register,  both  the  registers  being  specified  explicitly  in  the  micro- 
instruction, with  no  changes  in  the  source  operand.   The  microinstruc- 
tion, TI,  on  the  other  hand,  causes  the  transfer  of  operands  from  source 
to  destination  register,  with  the  sign  of  the  source  operand  being  in- 
verted, that  is,  changed  to  opposite  polarity. 

The  microinstruction  TD  allows  the  results  of  one  instruction  to  be 
stored  temporarily  into  another  local  register  before  being  used  as  an 
operand  in  the  execution  of  some  later  microinstruction,  thus  avoiding 
a  memory  reference.  A  second  application  of  this  microinstruction  is  in 
the  exchange  of  operands  when  normalization  is  required.   As  would  be 
seen  later  on,  because  the  Normalization  Recode  (NR)  and  Assimilation 
Recode  (AR)  microinstructions  require  the  operand  to  be  in  only  the 
Accumulator  register,  assimilation  and  normalization  of  operands  would 
require  the  use  of  microinstruction  TD  for  moving  the  operand  to  the 
Accumulator  register. 

The  main  use  of  the  microinstruction  TI  occurs  when  one  needs  to 
change  the  sign  of  an  operand  before  being  used,  e.g.,  in  the  case  of 
subtraction.   Since  the  PE  has  only  an  'add'  microinstruction,  it  is 
necessary  to  invert  the  sign  of  the  operand  before  being  'added'  to 
another  operand  to  cause  subtraction.   Note  that,  in  this  microinstruction, 
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the  source  and  destination  register  addresses  can  be  the  same.   This 
microinstruction  can  thus  be  used,  if  necessary,  for  getting  the 
absolute  value  of  an  operand. 

In  all  the  inter-register  transfers,  all  of  the  data  required  by  a 
PE  to  perform  the  microinstruction  is  contained  within  that  PE  itself. 
It  can  be  seen  in  Figure  2.3.   Each  PE  contains  one  digit  of  each  of  the 
operands.   Therefore  the  value  of  a,  the  number  of  PEs  which  must  logi- 
cally cooperate  with  the  PE  executing  the  inter-register  transfer  micro- 
instruction, is  zero,  and  .F.  is  not  required  to  transmit  data.   The 

j  i 

value  of  ,F.  is  used  instead  to  identify  both  the  registers  taking  part 
in  the  transfer.   The  exact  format  of  the  microinstructions  TD  and  TI 
is  discussed  in  Section  4.3.2.2. 

In  the  notation  of  Section  2.4,  the  inter-register  transfer  micro- 
instructions may  be  expressed  as: 

jXi  =  j-lyi  i  =  1,  2,  ...,  n  (2.5) 

/i  -  jVl  i  =  1,  2,  ...,  n  (2.6) 

jGi  =   <null>         i  =  1,  2,  ...,  n  (2.7) 

where 

Y        is  the  register  to  be  copied  into  the  X  register, 
.xi      is  the  i   digit  of  the  X  register  after  the  transfer, 
i-lyi    is  the  *"   di8it:  of  tne  Y  register  before  the  transfer,  and 
<null>   indicates  that  the  value  of  .G.  is  not  required  when  per- 
forming inter-register  transfers. 
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2.6.2  The  shift  microinstructions  -  These  microinstructions  are 
used  during  radix  point  alignment  prior  to  addition  or  subtraction,  for 
normalization,  and  for  multiplication  and  division  by  the  radix  during 
the  repetitive  steps  for  multiplication  and  division.   A  shift  of  more 
than  one  digital  position  is  performed  as  a  number  of  successive  shifts 
of  one  digital  position  each. 

The  left  shift  can  be  accomplished  by  causing  the  PE  to  the  immed- 
iate right  of  the  PE  performing  the  microinstruction  to  transmit  the 
value  of  the  digit  of  the  operand  contained  in  its  register  to  the  PE 
performing  the  microinstruction.   This  PE  stores  the  digit  it  receives 
in  its  operand  register.   The  equations  defining  the  left  shift  micro- 
instruction, LS,  are: 

jXi  =  jGi+l        i  =  1,  2,  ...,  n  (2.8) 

jFi  =  jFi-l        i  "  1.  2,  ...,  n  (2.9) 

jGi  =  j-lxi        i  =  1,  2,  ...,  n  (2.10) 

.G  .,   =   .F  if  .F   is  a  valid  digit  (2.11) 

3   n+1     J  n  j  n 


otherwise  see  text 


where 


X         is  the  operand  being  shifted, 

.x . 

J  i 


1"  v» 
x.       is  the  i   digit  of  the  shifted  operand, 


.  nx.     is  the  i   digit  of  X  before  the  shift, 
j-1  i 

.F        is  the  modifier  value  passed  along  with  the  microinstruc- 
J         tion  and  carries  the  address  of  the  register  to  be  shifted 
and  the  value  of  a  digit  sent  by  MCU  that  is  to  go  into 
the  last  PE.   This  is  made  use  of  in  the  execution  of 
Multiplication . 
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,Fn   is  the  value  that  the  MCU  sends  to  PE,  with  the  left  shift 
J  0  1 

microinstruction  to  indicate  the  value  that  is  to  go  into  the  last  PE. 
If   F   is  a  valid  digit,  it  becomes  the  digit  shifted  into  the  last  PE. 
If  it  is  not  a  valid  digit,  it  causes  the  End  Unit  to  shift-in  the  digit 
shifted  out  during  the  last  right  shift. 

One  should  also  note  that  the  left  shift  microinstructions  make  it 
possible  to  transmit  the  most  significant  digit  of  an  operand  to  the 
MCU.   The  left  shift  can  therefore  be  used  by  the  MCU  to  examine 
operands. 

The  right  shift  (RS)  microinstruction  does  not  have  the  complexity 
of  the  left  shift  microinstruction.   The  value  stored  into  a  PE  is  the 
value  transmitted  by  its  left  neighbor  PE  with  the  indication  that  a 
right  shift  is  to  be  performed.   The  value  of  the  digit  to  be  stored  in 
the  first  PE  is  determined  by  the  MCU.   In  the  terminology  of  Equations 
2.1  through  2.3, 


(2.12) 
(2.13) 
(2.14) 


3X1 

— 

j'i-i 

i  = 

1, 

2, 

. .  .  ,   n 

.F. 

= 

j-ixi 

i  = 

1, 

2, 

.  .  . ,    n 

.G. 
3    i 

= 

<null> 

i  = 

1, 

2, 

.  .  .  ,    n 

where 


.F,,     is  the  digit  which  the  MCU  transmits  with  the  indication 
j  0 

that  a  right  shift  is  to  be  performed.   This  value  becomes 
the  value  of  the  most  significant  digit  of  the  shifted 
operand . 
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The  value  of  .F  ,  which  is  transmitted  by  PE  to  the  'End  Unit1,  is 
stored  as  the  new  top  element  in  the  push-down  stack.   The  push-down  stack 
is  essentially  an  extended  version  of  'guard'  digits. 

A  final  note  concerning  shifts  is  that  the  value  of  a   =  1  and 
a   =  0.   The  exact  format  of  the  microinstructions  LS  and  RS  is 

Kb 

described  in  Section  4.3.2.2. 

2.6.3  The  arithmetic  microinstructions  -  The  microinstructions  in 
this  class  are  those  instructions  which  do  some  sort  of  arithmetic 
transformation  on  the  operands.   These  microinstructions  operate  on  one, 
two  or  more  than  two  operands,  depending  on  the  nature  of  the  micro- 
instructions.  The  various  microinstructions  in  this  class  are:   Form 
Multiple  and  Add  (FMA) ,  Simple  Sum  (SS)  ,  Multiple  Sum  (MS),  Assimilation 
Recode  (AR)  and  Normalization  Recode  (NR) . 

The  microinstruction  FMA  is  used  to  form  the  product  of  a  multiplier 
(quotient)  digit  and  a  multiplicand  (divisor)  digit  and  add  (subtract) 
it  to  (from)  the  partial  product  (partial  remainder)  in  the  execution  of 
Multiplication  (Division)  instruction  and  is  the  most  complex  of  all 
microinstructions. 

The  microinstruction,  SS,  sums  the  contents  of  two  registers  and  is 
used  to  execute  the  Add  or  Subtract  instructions.   Although  the  micro- 
instruction FMA  could  be  used  for  this  purpose,  a  separate  microinstruction 
SS  was  designed  for  faster  operation,  especially  because  the  frequency  of 
addition  or  subtraction  of  two  operands  in  a  computer  program  is  much 
higher  than  multiplication  or  division. 
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The  Multiple  Sum  microinstruction,  MS,  is  used  to  add  the  contents 
of  more  than  two  registers  in  a  PE.   This  microinstruction  is  not  in- 
trinsically necessary  for  the  operation  of  the  arithmetic  unit  but 
rather  comes  about  as  a  useful  by-product  of  the  design  of  the  logic 
for  microinstruction  FMA. 

The  microinstruction  NR  operates  on  a  single  operand  in  the  Accum- 
ulator register  only.  It  is  used  to  recode  the  operand  in  a  form  which 
when  left-shifted  one  or  more  places  meets  the  normalization  definition. 

Finally,  the  Assimilation  Recode  microinstruction,  AR,  is  used  to 
convert  the  operand  in  the  Accumulator  register,  from  the  number  repre- 
sentation used  in  the  arithmetic  processing,  to  the  conventional  form 
for  communication  to  memory  or  other  parts  of  the  computer  system. 
This  microinstruction  is  very  similar  to  microinstruction  NR. 

All  the  above  microinstructions  are  discussed  in  detail  in 
Chapter  3. 

2. 6. A  The  memory-accessing  microinstructions  -  These  microinstruc- 
tions cause  the  exchange  of  data  between  the  internal  registers  of  the 
PEs  and  a  local  buffer  operand  memory.   They  are  used  to  fetch  operands 
into  PEs  for  processing  and  to  store  the  results  for  eventual  trans- 
mission to  the  main  memory  of  the  computer.   The  two  microinstructions 
are  Load  from  Processor  Memory  (LPM)  and  Store  into  Processor  Memory 
(SPM) ;  the  former  is  used  to  bring  operands  into  PE  registers  and  the 
latter  causes  the  contents  of  a  specified  PE  register  to  be  stored  into 
a  specified  location  of  the  local  Operand  Processor  Memory.   The 
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microinstructions  in  this  class  are  similar  to  the  inter-register  transfer 
microinstructions  except  that  one  of  the  source  or  destination  address 
refers  to  some  location  in  the  local  Operand  Processor  Memory.   In  these 
microinstructions  also,  the  modifier  .F.  is  used  to  identify  the  source 
and  the  destination.   The  exact  format  of  these  microinstructions  are 
discussed  in  Section  4.3.2.2.   The  communication  of  the  PEs  with  the  Data 
Main  Memory  via  the  local  Operand  Processor  Memory  is  discussed  in 
Chapter  5. 

2.6.5  The  miscellaneous  microinstructions  -  One  instruction  in  this 
class  is  Load  Constant  (LDC) .   This  microinstruction  can  be  used  to  clear 
the  operand  register  by  loading  zeros  in  a  specified  register  of  all  the 
PEs  in  the  arithmetic  unit.   It  can  also  be  used  to  initialize  an 
operand  register  spread  across  all  the  PEs  to  a  pattern  such  that  all 
the  digits  are  identical.   An  example  of  such  an  use  could  be  the  load- 
ing of  maximum  value  of  operands.   In  the  terminology  of  Section  2.4, 


jXi  =  jFi-l         i  =  1,  2,  ...,  n  (2.15) 

jFi  =  jFi-l  i  =  1,  2,  ...,  n  (2.16) 

.G.   =   null  i  =  1,  2,  ...,  n  (2.17) 

J  i 

.F   =  digit  which  the  MCU  sends  to  PE,  with  the  LDC  micro- 
J  °  1 

instruction  and  the  register  name  in  which  the  constant 
is  to  be  loaded. 


A3 


Clearly,  a    =  0  which  means  that  no  information  is  needed  from  its 
right  neighbor  for  the  execution  of  this  microinstruction.   Note  that 
in  Equation  2.15,  only  the  digit  part  of  field  .F    is  stored  in 
register  ,X. . 
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3.  ARITHMETIC  DESIGN  AND  IMPLEMENTATION  CONSIDERATIONS 

3.1  Introduction 

This  chapter  describes  the  arithmetic  design  of  the  Processing  Ele- 
ment.  Arithmetic  Design  consists  of  the  choice  of  a  suitable  number 
system,  number  representation,  and  the  development  of  suitable  digit  level 
algorithms.   Serial  processing  in  an  iterative  structure  has  important 
implications  on  all  of  these  factors  and  will  be  considered  in  this 
chapter.   Implementation  of  the  digit  algorithm  and  its  implications  for 
LSI  realization  of  the  Processing  Element  are  also  discussed. 

3 .2  Implications  of  Serial  Processing  on  Arithmetic  Design 

From  the  description  of  processing  in  Section  2.4,  it  is  evident  that 
the  results  are  obtained  on  a  digit-by-digit  basis.   To  achieve  a  compro- 
mise between  the  digit  serial  processing  and  the  arithmetic  speed,  the 
arithmetic  should  be  carried  out  in  higher  radix  say  r  =  2   (k  >  1)  such 
that  k  bits  of  the  result  are  obtained  at  any  step. 

Since  the  processes  of  quotient  generation,  operand  normalization, 
mantissa  overflow  determination  and  the  determination  of  the  sign  of  the 
result  inherently  require  the  examination  of  the  most  significant  digits  of 
operands  to  determine  what  additional  processing  is  necessary,  arithmetic 
algorithms  should  be  so  designed  that  the  most  significant  digits  of  the 
result  are  obtained  first.   The  most-significant-digit-first  (MSDF) 
approach  has  the  advantages  of  providing  early  status  indication  (over- 
flow, sign  of  the  result,  etc.),  normalization  concurrent  with  processing 
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and  early  termination  of  processing  as  soon  as  enough  significant  digits 
in  the  result  have  been  obtained.   The  latter  would  allow  faster  variable 
precision  arithmetic  in  a  digit  serial  environment.   Early  status  indi- 
cation would  also  aid  in  an  instruction  look-ahead  unit.   Further,  the 
MSDF  approach  allows  the  meshing-in  (pipeline)  of  successive  macroinstruc- 
tions  for  efficient  operation.   For  example,  if  a  MULTIPLY  instruction  is 
followed  by  a  DIVIDE  instruction,  at  some  point  in  time,  the  least  sig- 
nificant digits  of  the  product  can  be  generated  by  a  right  directed 
procedure  in  the  least  significant  elements  of  the  iterative  structure, 
while  the  most  significant  elements  are  generating  quotient  digits. 

3 . 3   Choice  of  Number  System 

For  a  smooth  flow  of  microinstructions  in  the  linear  iterative 

structure  and  for  maximizing  the  rate  of  computation,  two  constraints  on 

t 
a.   are  necessary: 

a)  The  microinstructions  should  have  regular  data  requirements  in- 
dependent of  the  significance  of  the  digits  retained  by  a  PE.   That  is, 

a.  should  be  constant. 
J 

b)  The  value  of  a.  should  be  as  small  as  possible  because  the 

execution  rate  of  a  given  microinstruction  is  inversely  proportional 

to  a . . 
J 

In  a  conventional  weighted  number  system,  a  carry  or  borrow  into  any  digi- 
tal position  is  a  function  of  all  the  digits  to  the  right  of  this  position, 


t 
a^  is  the  number  of  PEs  from  which  a  given  PE  requires  information 

(in  the  logical  sense)  in  order  to  execute  the  microinstruction  u.. 
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Thus  for  MSDF  algorithms  which  are  right  directed,  the  conventional 
number  system  cannot  be  employed  because  in  a  conventional  number  system, 
the  value  of  a .  is  a  function  of  the  significance  of  the  digit  itself. 
A  redundant  number  system  which  gives  a  bounded  value  of  a .  is  clearly 
essential. 


3.4   Choice  of  Number  Representation  and  Amount  of  Redundancy 

The  major  factors  influencing  the  choice  of  the  redundant  number 
representation  and  the  amount  of  redundancy  in  the  number  system  are 
the  following: 

a)  the  ease  of  conversion  from  the  conventional  number  representa- 
tion to  the  redundant  number  representation, 

b)  its  compatibility  with  the  widely  employed  conventional  binary 
number  system, 

c)  ease  of  normalization  of  operands  to  radix-2  limits,  and 

d)  LSI  technology  constraints,  namely 

(i)   minimization  of  the  number  of  types  of  cells  (in  the  arith- 
metic  and  logic  sense)  required  for  higher  radix  (r  =  2  ) 
implementation  of  the  digit  processing  logic,  and 
(ii)   minimization  of  the  number  of  input  and  output  pins. 
In  this  study,  signed-digit  redundant  number  representations  with  maximal 
redundancy  were  chosen,  because  they  satisfy  most  of  these  requirements. 

3.4.1  Signed-digit  number  representations  -  Signed-Digit  (SD) 
representations  are  redundant  positional  representations. 
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A  number  X  is  represented,  in  radix-r,  redundant,  signed-digit 
format,  as  a  digit  vector  (abbreviated  as  "d-vector")  of  length  n  +  m  +  1 


A.  "  X     X/    ■ v   •   *   •  Xrt  x_  x«   •   •   •  X    -   X 

-m  -(m-1)        0  12        n-1  n 


such  that 


-i 

r 


X  =   I  x. 

.  L  i 


i=  -m 


where 

x±     e  {d,(d-l),...,l,0,l,...,(d-l),d} 

and 

ffj  <  d  1  (r-D  • 

The  overbar  indicates  negative  values  and  unless  otherwise  specified,  we 
shall  be  using  rightward  indexing  in  the  d-vector  representation.   For 
maximally  redundant,  signed-digit  number  systems 

d  =  r  -  1. 
That  is,  for  a  radix  r  =  2  ,  each  digit  of  the  radix-r  digit  vector  can 


k        -  k 

assume  any  integer  value  in  the  digit  set  {(2  -1) , . . . ,1,0,1, . . . , (2  -1)}. 

Some  of  the  desirable  properties  of  signed-digit  representations  are: 

1.  Representation  of  zero  is  unique.  An  algebraic  value  of  X  =  0 
if,  and  only  if,  all  x.  =0. 

2.  The  additive  inverse  (negation)  of  an  operand  is  very  simply 
achieved  by  reversing  the  sign  of  every  non-zero  digit  individually. 

3.  The  sign  of  the  algebraic  value  of  X  is  given  by  the  sign  of  the 
most-significant  (leftmost)  non-zero  digit. 
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4.   For  the  sum  or  difference  of  two  signed-digit  operands, 

a.  =  1. 
1 

Maximal  redundancy  is  compatible  with  the  widely  used  sign-magnitude 
representation  of  conventional  binary  input  operands.   A  binary  number 
may  be  interpreted  as  a  number  of  radix  r  =  2  by  grouping  the  binary 
digits  into  groups  of  k  bits  each.   Conversion  from  the  conventional  number 
system  to  signed-digit  form  is  simply  carried  out  by  just  attaching  the 
sign  of  the  conventional  number  to  each  digit.   Another  important  advan- 
tage is  the  fact  that  the  carry  between  bits  of  a  digit  has  the  same 
properties  as  the  carry  between  digits  whereas  in  the  other  than  maximally 
redundant  representations  such  is  not  the  case.   This  allows  the  radix-2 

arithmetic,  for  example  shifts,  etc.,  if  necessary.   From  the  LSI  view- 
It 
point,  it  allows  a  radix  r  =  2  arithmetic  structure  to  be  composed  of  k 

identical  and  simpler  radix-2  substructures  interconnected  in  a  regular 

pattern.   Maximal  redundancy  also  provides  more  code-space  patterns  [36] 

for  testing  the  radix-r  module.   This  makes  the  design  of  a  self -testing 

version  of  the  module  easier. 

Two  modes  of  representation  for  a  signed-digit  of  the  radix-2 

d-vector  are  used,  depending  on  the  area  of  application: 

a)   Sign-Magnitude  (SM  )  Mode  -  Each  radix-2  digit  x   is  represented 

by  a  single  sign  bit  s.  and  k  magnitude  bits,  x.   (j=0,l, . . . ,k-l) 

such  that 

s .   k-1 

x.   =   ( — 1)  1   I   x.  .  23   ,   s.,  x.   e  {0,1} 
l  .  ~   l.  l   l. 

J=0    J  J 


49 


b)   Redundant-Binary  (RB  )  Mode  -  Each  radix-2  digit  x   is  repre- 
sented by  k  redundant  binary  digits  x*   (j=0,l, . . . ,k-l) ,  such 

J 

that 

k-1 
x.   ■   I       x*   2J   ,  x*   e   {1,0,1). 
j-0   Xj         lj 


(Note  that  in  the  above  representation  of  x .  in  terms  of  radix-2  sub- 
digits,  we  use  zero-origin  leftward  indexing.) 

The  SM  mode  requires  k+1  binary  storage  elements  (or  k+1  pins  as 
an  output  from  the  processing  element)  and  the  RB  mode  needs  2k  binary 
storage  elements  (or  pins)  because  each  redundant  binary  digit  requires 
two  binary  state  elements.   The  SM  representation  for  a  radix-r  digit  is 
used  for  inter-PE  communication  to  keep  the  number  of  external  I/O  pins 
small.   The  RB  mode  of  representation  is  used  for  implementing  digit 
algorithms  (as  will  be  seen) .   If  each  redundant-binary  digit  is  expressed 
in  sign  and  magnitude  form,  conversion  from  SM  to  RB  mode  is  trivial 
and  involves  appending  the  single  sign  bit  to  each  of  the  k  magnitude 

bits.   Conversion  from  RB   to  SM  is  less  trivial,  however,  and  involves 

r      r  '        ' 

recognition  that  the  sign  of  the  radix-2  digit  is  that  of  the  most  sig- 
nificant non-zero  binary  digit,  followed  by  subtraction  of  the  magnitudes 
of  those  digits  of  opposite  sign  from  the  magnitudes  of  those  binary 
digits  of  the  same  sign. 

3.4.2  Number  format  and  range  for  mantissa  -  In  this  thesis,  the 
mantissa  is  assumed  to  be  represented  by  a  one-origin  right  indexed  d- 
vector  of  length  n.   The  radix  point  is  assumed  to  be  at  the  left  of  the 
most  significant  digit  with  index  one.   That  is, 
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vl  .     Ill        11 

A   -  •  X   X~  X_    •   •   i  X   _  X 

12   3         n-1  n 


For  a  conventional  number  representation,  the  values  of  digit  x  are 
{0,1, . . . ,r-l}  and  for  the  signed-digit  format,  the  digit  x.  can  assume  any 


value  in  the  digit  set  { (r-1) , (r-2) ,... ,1,0,1, ..., (r-2) , (r-1) } . 

When  more  than  one  operand  is  considered,  the  superscript  is  employed 

1      2 
to  identify  a  specific  operand,  i.e.,  X  and  X   for  two  operands  or 

X  ,X  , . . . ,XJ  , .  .  .  ,X   for  I   operands.   The  i-th  digit  of  y?    is  uniquely 

identified  as  x.  . 

i 

The  algebraic  value  of  the  mantissa  is  given  by 

i      n    i 
vl      v    1   -i 
X   =   )    x.   r 

ii    1 
i*l 

and  -1  <  X1  <  1. 


3.5  Normalization  Considerations 

For  the  preparation  of  operands  and  the  processing  of  results,  it  is 
necessary  to  restrict  the  range  of  values  which  the  mantissa  may  assume. 
One  usually  restricts  this  range  by  requiring  that  all  operands  be 
normalized.   This  is  generally  done  by  defining  the  form  of  d-vector 
representation  of  the  restricted  range  operands.   However,  in  redundant 
number  representations,  there  exist  pseudo-normal  forms,  because  more  than 
one  d-vector  representation  is  possible  for  the  same  algebraic  value.   For 
example,  the  two  numbers  X  and  X' 


X1   =   .00...1 


X'   =   .l(r-l)(r-l)...(r-l) 
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have  the  same  algebraic  value.   The  representation  X'  "  satisfies  the  con- 
ventional normalization  condition  x'  4   0  but  not  the  minimum  magnitude 
(>  H)    requirements  for  its  algebraic  value. 

3.5.1  Definition  and  range  of  normalized  numbers  -  Three  alternative 
definitions  of  normalized  operands  were  considered. 

Definition  1 

A  number  X  (of  nonzero  algebraic  value)  is  considered  normalized 

+  _ 
when  its  d-vector  representation  X  =  .x  x  . .  .x  -  satisfies  either  of  two 

conditions 

a.  |x1|  >_2 

b.  |x,  |  =1  and  x..  .x„  >_  0 

Definition  2 

A  number  X  (of  nonzero  algebraic  value)  is  considered  normalized 

when  in  its  d-vector  representation  X  =  .xnx0...x  ,  x   ...  x  either 

1  2    t-1  t      n 

a.   |x  |  >_  2 


or 


b.    xn   =  1,  x0  =  x_  =  . . .  =  x   ,=0  and 
1  l1       2    3  t-1 


x.  .  x  >  0   ,   t  <  n 
It  — 


x1  .  x  =0   ,   t  =  n 

1      T 


where  n  is  the  length  of  the  operand. 


In  these  definitions,  the  superscript  on  X  has  been  dropped  for 
ease  of  readability. 
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Definition  3 

A  number  X  (of  nonzero  algebraic  value)  is  considered  normalized 

when  its  d-vector  representation  X  =  .  x,x.x,  ...  x.  ...  x   satisfies  the 

12  3      j    .  n 

conditions 

a.   |x  |  >  0 

and   b.   x,  .  x.  >  0 
1    l  — 

such  that  2  <_  i  <  j  and  x.  is  the  first  (counting  from  left)  zero  digit 

in  the  d-vector  of  x.   For  example  X  =  .11101  is  considered  unnormalized 

per  Definition  3. 

The  range  of  values  for  the  normalized  operands  under  Definitions  1 
and  3  is 

r-1     1,1,1 

r-   +   <   X    <   1 

2      n  —  '  '  —      n 
r      r  r 

and  for  operands,  normalized  according  to  Definition  2,  the  range  is 

i  <  |*|  <  i--L 

r  —  '  '  —      n 

r 

Note  that  the  Definition  2  is  equivalent  to  the  conventional  definition. 

Of  the  three  definitions,  the  Definition  3  was  adopted.   The  factors 
affecting  the  choice  between  the  three  definitions  are: 

i)   Complexity  of  normalization  implementation, 

ii)  Amount  of  significance  loss, 
and  iii)  Logic  complexity  of  quotient  selection. 

For  normalizing  numbers  according  to  Definition  2,  one  needs  to 
examine  more  digits  than  for  Definition  1.   If  immediately  following 
|x..  |  =  1,  there  is  a  string  of  zeros  of  length  v,  the  normalization  pro- 
cedure must  examine  at  least  v+2  digits  (to  determine  the  sign  of  the 


53 


first  nonzero  digit  following  the  string  of  zeros)  in  the  case  of 

Definition  2,  whereas  only  2  digits  need  to  be  looked  at  for  Definition  1. 

Since  the  examination  of  digits  is  essentially  a  serial  process,  it  takes 

v  extra  steps  for  Definition  2. 

When  the  results  are  normalized  according  to  Definitions  1  and  3  there 

may  be  a  potential  loss  of  one  extra  radix-r  significant  digit,  compared 

to  Definition  2.   Such  a  case  can  occur  when  a  result  d-vector  is  of  the 

form  1.0  0...X.  x.,,...x  and  a  post-normalization  shift  becomes  necessary, 
l   l+l    n 

However,  it  is  expected  that  such  a  case  would  not  occur  very  often  be- 
cause for  higher  radix  arithmetic  the  frequency  of  zero  digits  is  low, 
and  also  the  overflow  occurs  less  often  [37] . 

Finally,  because  of  the  redundant  number  representation,  the  quotient 
is  calculated  based  on  a  truncated  version  of  the  partial  remainder  and 
the  divisor.   The  number  of  digits  of  the  truncated  operands  necessary 
for  quotient  calculation  depends  on  the  minimum  algebraic  value  of  the 

truncated  divisor,  say  D  .  .  The  lower  the  value  of  D  .  ,  the  greater  are  the 

'    '     min  mm'     ° 

number  of  digits  of  the  truncated  divisor  and  partial  remainder  necessary 
for  the  quotient  calculation.   For  higher  values  of  radix  r,  e.g.,  r  ^_  16, 
the  difference  in  the  minimum  value  of  the  truncated,  normalized  divisor 
for  Definitions  1  and  2  is  very  small  and  the  number  of  digits  required 
for  quotient  calculation  remains  the  same.   However,  for  lower  radices 
(8  ^_   r  >_  2)  ,  the  number  of  digits  required  and  thus  the  logic  complexity 
for  quotient  calculation  is  greater  for  Definitions  1  and  3  than  for 
Definition  2. 


In  the  case  of  Definition  2,  this  number  would  have  to  be  normalized 
further  to  the  form  . (r-1) (r-1) . . . (x .-1)  x. -  . . .x  and  a  post  normaliza- 
tion shift  would  not  be  necessary. 
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From  the  above  discussion  of  factors  affecting  the  choice  of  defini- 
tion of  normalized  numbers  for  maximally  redundant  signed-digit  operands, 
it  is  clear  that  any  of  the  three  choices  would  be  almost  equally  useful 
for  higher  radices  (r  >  16).   But  for  r  =  2,4  where  the  probability  of  a 
string  of  zero  is  higher,  Definition  1  or  3  would  definitely  be  better 
for  faster  normalization,  although  the  logic  complexity  of  quotient  cal- 
culation would  correspondingly  be  increased.   The  speed  of  quotient 
calculation  and  thus  the  speed  of  the  DIVIDE  instruction  would  be  decreased 
But  the  frequency  of  DIVIDE  instructions  is  rather  low  compared  to  ADD  in- 
structions and  so  Definition  1  or  3  would  overall  add  to  the  speed  of  the 
arithmetic  processing. 

For  the  present  research,  Definition  3  was  chosen  because  of  its 
compatibility  with  the  Assimilation  Recode  (AR)  microinstruction's  digit 
algorithm.   The  Assimilation  Recode  algorithm  converts  a  signed-digit 
operand  into  a  conventional  sign-magnitude  operand.   This  compatibility 
allows  the  sharing  of  logic  in  the  implementation  of  Normalize  Recode  (NR) 
microinstruction  and  microinstruction  AR  and  thus  reduces  the  control 
complexity  of  a  PE.   Digit  algorithms  for  microinstructions  NR  and  AR  are 
discussed  in  Sections  3.6.4  and  3.6.5. 

Normalization  of  an  operand  is  achieved  by  shifting  out  leading  zeros, 
followed  by  a  'Normalize  Recode'  microinstruction,  again  followed  by 
shifting  out  leading  zeros,  if  any.   This  is  discussed  further  in 
Section  6.2.6. 
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3.6  Arithmetic  Microinstructions  and  Corresponding  Digit  Algorithms 

In  the  present  research,  the  design  of  the  Processing  Element  is  re- 
stricted to  the  capability  of  performing  the  four  basic  arithmetic 
processes  of  Addition,  Subtraction,  Multiplication  and  Division  of  two 
operands.  Multiplication  and  Division  are  implemented  as  a  number  of 
additions  or  subtractions  (of  a  multiple  of  multiplicand  or  divisor)  and 
shifts  as  in  a  classical  Von-Neumann  Arithmetic  Unit.   Hence  the  basic 
arithmetic  microinstruction  necessary  is  of  the  form 

XW  =  XU  +   (xj  *  XV)  (3.6.1) 

w   u      v  q 

where  X  ,  X  and  X  are  d-vectors  and  x;  is  a  digit.   In  case  of  multi- 

l        ° 

plication  (division),  X  is  the  multiplicand  (divisor),  X  is  the  old 
partial  product  (partial  remainder)  and  X  is  the  new  partial  product 
(partial  remainder)  and  x_?  is  the  signed  multiplier  (quotient)  digit. 

The  microinstruction  which  achieves  (3.6.1)  is  termed  'Form  Multiple 
and  Add  '  (FMA)  . 

Other  microinstructions  of  an  arithmetical  nature  which  are  needed 
for  the  execution  of  four  basic  arithmetic  processes  are  Simple  Sum  (SS) , 
Multiple  Sum  (MS) ,  Normalize  Recode  (NR) ,  and  Assimilation  Recode  (AR) . 
The  function  of  each  of  these  microinstructions  and  the  corresponding 
digit  algorithm  for  execution  of  the  microinstruction  in  a  processing 
element  is  discussed  next. 

3.6.1  Simple  sum  (SS)  microinstruction  -  This  microinstruction  forms 
the  sum  of  two  signed-digit  operands  say  A  and  $  such  that 

A"  -  A  +  $ 
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where  A'  is  the  new  value  of  the  operand  A.   In  general,  A  and  A'  are 
in  the  Accumulator  register  of  the  Processing  Elements. 

At  the  digit  level,  the  SS  microinstruction  is  characterized  by 

aI  -  ai  +  *i  +  Ti  -  rTi-i 

where  a.,  a"!  and  <j> .  are  radix-r  signed  digits  of  the  operand  in  the  active 
registers  of  PE . ,  T.  is  the  'Transfer'  (carry-borrow)  from  the  adjacent 
processing  element  PE   ,  and  T.  1  is  the  'Transfer'  out  of  the  PE  . 

3.6.1.1  Digit  algorithm  -  The  specification  of  the  digit  algorithm 
for  SS  is  intimately  connected  with  its  implementation  and  is  described 
below  in  terms  of  its  algebraic  implementation. 

Because  of  the  structural  regularity  requirements  of  the  LSI  tech- 
nology, the  sum  of  two  radix-r  signed  digits  a.  and  <j> .  is  realized  in  a 
linear  cascade  of  k,  two  input  redundant  binary  adders. 

This  is  schematically  shown  in  Figure  3.1.   RBA-2  is  a  two  input 

*    * 

redundant  binary  adder  which  accepts  two  redundant  binary  digits  a.  ,  $. 

v    v 

e  {1,0,1}  and  produces  one  redundant  binary  digit.   The  design  of  such 

an  REA-2  was  studied  in  detail  by  Borovec  [38]  and  we  shall  interchangeably 

use  the  term  Borovec  Unit  (BU)  for  RBA-2. 

3.6.1.1.1  Arithmetic  design  of  RBA-2  -  The  major  consideration  in 
the  design  of  RBA-2  was  the  minimization  of  the  number  of  pins  required 
for  the  'Transfer'  into  and  out  of  an  RBA-2.   One  such  design  is  shown 
in  Figure  3.2.   RBA-2  is  realized  by  a  series  of  four  arithmetic  trans- 
formations as  follows. 
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1,0,1 


1 


0,1 -«— 

o,T«*— 

i 

"*N,    RBA-2 

«— 0,1 
*— 0,1 

t    t 

1,0,1    1,0,1 

Figure  3»2  Arithmetic  Structure  of  an  RBA-2 
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ol    :   $J  -  ft  +  $± 

V  V        V 

°2  :   ai  +  *i   =  Wi   +    i 
v      v      v        v 

°3  :    wi   +   *i   +  %"±     =  wi   +  2tt    . 

v      v      v     v      v-1 

ii      +      i* 

aA  :  Wi  +  Ci  =  ai 

V  V        V 

*       *  '  *       — 

where  a.   ,  <J>.   and  a    e  {1,0,1} 

V       V  V 

t~   ,  t~    ,  f±  e    {0,1} 

v     v-1     V 

and    t    ,  t     ,  <ji      e  {0,1} 
v     v-1     v 

Let  us  call  t.   ,  tj     'negative  transfers'  and  t .  ,  t .    'positive 
l   '   i  n  i    i  . 

v     v-1  v    v-1 

transfers'.   The  logic  design  of  RBA-2  is  discussed  in  Section  4.2.2.3. 

It  is  clear  from  the  design  of  RBA-2  above  that  the  transfer  digits  t. 

v 

and  t.   are  respectively  dependent  on  the  inputs  to  the  RBA-2s  which 

v 

are  immediately  adjacent  and  one  next  to  it. 

In  terms  of  the  notation  of  Section  2.4,  we  have  for  the  SS  micro- 
instruction       T?        T? 

SSFi  "  SSFi-l        for  all  i»  1  1  1  1  n 


SSF0  =  <null>  ' 


+ 

ssGi+i  -  ui'  V 


and       a    =  2     for  r  =  2 
=  1     for  r  >  2 
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3.6.2  Form  multiple  and  add  (FMA)  microinstruction  -  This  micro- 
instruction is  used  to  form  the  product  of  the  multiplicand  (divisor) 
d-vector  and  a  multiplier  (quotient)  digit,  which  when  added  to  the  old 
partial  product  (partial  remainder)  gives  the  new  partial  product 
(partial  remainder)  in  the  execution  of  a  Multiplication  (Division)  of 
two  d-vector  operands. 

At  the  digit  level,  this  microinstruction  is  characterized  by  the 
arithmetic  transfer  function 


aC  =  a.  +  m.  •  <j>  .  -  r  T.  .,  +  TJ  (3.6.3) 

l    i    j    i      l-l    i 


where 


a.,  <b  .   are  the  digits  in  the  active  operand  registers  of  the 
11 

processing  element  PE . 
a'      is  the  new  value  of  digit  a. 

m.      is  a  multiplier  (quotient)  digit 

r      is  the  radix 
and  T.(T.  -)is  the  'Transfer'  (carry/borrow)  from  (to)  adjacent  process- 
ing element  PE    (PE   )  . 
This  is  functionally  represented  in  Figure  3.3. 

3.6.2.1  Digit  algorithm  -  The  major  considerations  in  the  design  of 
the  digit  algorithm  for  microinstruction  FMA  were  the  LSI  technology  con- 
straints— namely,  that  the  implementation  logic  for  FMA  should  consist  of 
a  regular  and  repetitive  structure.   The  specification  of  the  digit 
algorithm  is  intimately  connected  with  the  implementation  procedure  and 
is  described  below  as  such. 
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Figure  3.3  Functional  Representation  of  Micro- 
instruction FMA. 
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The  transfer  function  in  Equation  (3.6.3)  is  achieved  by  a  series  of 
two  transformations  f   and  f„  as  shown  in  Figure  3.4.   The  two  trans- 
formations are 


fl  :  mj  '  *i  =  r  ti-l  +  wi  (3.6.4) 

and       f2  :   w.  +  a.  +  t*  +   t£  =  a|  +  r  t^  .  (3.6.5) 


Transformation  f^  essentially  requires  a  radix-r  multi-input  adder  which 
forms  the  sum  of  digits  of  both  signs.  This  multi-input  adder  is  imple- 
mented as  a  k-stage  linear  cascade  of  radix-2  multi-input  adder  where 

each  input  of  a  radix-2  adder  can  assume  three  values  1,0,1.   The  input 

p 
digits  w.,  a.,  t.  are  expressed  in  the  form  of  radix-2  d-vectors  such 

that  each  component  of  the  radix-2  d-vector  is  from  the  redundant  binary 
digit  set  {1,0,1}.   This  is  schematically  shown  in  Figure  3.5.   MIRBA 
represents  the  Multi-Input  Redundant  Binary  Adder.   The  number  of  redun- 
dant binary  inputs  to  each  MIRBA  are  determined  as  follows.   Two 

algorithms  were  studied  for  the  implementation  of  transformation  f,  given 

p 
in  Equation  (3.6.4).   They  differ  in  the  maximum  values  that  w.  and  t._1 

can  assume. 

3.6.2.1.1  Algorithm  1  -  To  illustrate  the  principle, 


Let     4,   =   I       **  .  2* 
£=0    I 

<j>*  ,  m*  e  {1,0,1} 
k-1  x£   Jq 

m.  =   I       m*   .  2q 
J     q=0    Jq 
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Figure  3.4   Functional  Representation  of  the  Digit 
Algorithm  for  FMA. 
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The  product  <J>   •  m  is  implemented  by  a  product  matrix  generator  which 
consists  of  a  k  x  k  square  array  of  redundant  binary  product  cells.   Each 
cell  performs  the  product  of  two  redundant  binary  digits  <$>*   and  m*  and 

*  - 

its  output  product  digit  p   is  also  in  the  digit  set  {1,0,1}. 

Jcq 

The  product  may  be  viewed  in  terms  of  the  sums  of  the  p   terms  of 

the  same  weight  in  the  product  matrix. 


2k-2 


♦  i  ■  m4   "   I 

1    3     v=0 


2k-2      /v 
"   I  2"  [l        h   v-J 

v=o     U=o   ^'v 
*     *    * 

where   pn    .  =  <j>  .   •  m 

* 

and     p    0  does  not  exist  when  either  SL  >  k-1 

or  v-Jl  >  k-1  . 

The  number  N  of  product  elements  in  the  v-th  column  of  the  product 

v 


matrix   is  given  by 


v+1  0  <  v  <_  k-1 

N     =      I  (3.6.7) 

1   -v  +  (2k-l)  k<_v<   2k-2 

k-1 
The  number  N  is  maximum  in  column  of  weight  2    and  is  equal  to  k.   The 

v 

product  elements  in  other  columns  decrease  uniformly  by  one  on  either 
side  of  this  column  as  shown  in  Figure  3.6. 


66 


liH 


a. 


*<T 


-e- 


-e- 


CM  CM 

I  I 

-e-  e 


i  I 

•e-  6 


o 

■K    O 

CX 


O  rH 

*    i-l       -K    O 

a.         ex 


CM  CO 

I  I 

a.  a 

o  .H 

,-T  cm" 

i  i 

•XX  -KM 

CX  CX 


ex 


i 

r 

CX 


I 

CM 
I 

CX 


CX 


-e- 


ex 


•H 

u 
■u 

CO 

a 


a 
3 

o 

(-1 

(X. 

>>» 

u 

CO 

c 

•H 

pq 


co 
c 

3 

CD 
Pi 


co 

1) 

a 

•H 


67 


Equation  (3.6.6)  can  be  rewritten  in  the  following  form 


1 


.  m 


k_1  v   /  V   *    \    2k"2     /  V   * 
3   v=0     U=0   '   /    v=k      U=0  *»v  *\ 


(3.6.8) 


k-1     /  v       \        2k-2    ,   /  v 

-  I   ^    I   P   ».J   +  ^   I        ^"k    I   pj  v_ 
v=0    U=0  *'*  y  v=k      U=0  fc,v   ^ 

The  columns  of  weight  2   (k  <_  v  <^  2k-2)  of  the  product  matrix  can  be 

p 
considered  as  forming  a  carry  t._.  called  Collective  Product  Transfer,  CPT 

to  the  next  more  significant  radix-2  digital  position  1-1.   These  (k  <_  v  <_ 

v-k 
2k-2)  CPT  columns  have  weights  2   '  (Equation  (3.6.8))  with  respect  to 

the  higher  significant  digital  position.   When  similar  CPT  columns  from 

digital  position  i+1  are  added  in  the  appropriate  (of  the  same  weight) 

MIRBA  of  the  digital  position  i,  all  the  stages  of  the  linear  cascade  of 

MIRBAs  in  PE.  become  identical,  each  MIRBA  having  k- inputs.   This  is 

illustrated  in  Figure  3.7. 

Further,  the  transformation  f„   requires  the  addition  of  one  radix-r 

P 
digit  a^  to  w  and  t  ,  and  the  digit  a.  contributes  one  redundant  binary- 
input  to  each  position  of  MIRBA.   Hence  the  transformation  f  'requires  k 
MIRBAs,  each  capable  of  summing  k+1  redundant  binary  inputs,  as  well  as 
the  'Transfer'  from  the  adjacent  MIRBA  position.   Figure  3.8  schematically 
shows  the  implementation  of  FMA  digit  algorithm  for  radix  16,  that  is, 
k=A. 

Values  of  |w. I     and  I t ,  „ I 
i  max '  i-l'max 

From  Equations  (3.6.4)  and  (3.6.8),  we  have 


68 


o 

3 

XI 

o 

U      ' 

ex  - 

H 

00  Pn 

c  u 

•H 

O.     » 

Pu    l-i 

co   <v 

rH    »4-l 

Jj     CO 

cu   c 

&  2 

H 

■P 

C    4J 

CU     U 

O     3 

(0   x) 

f-J  o 

Xj    U 

(4-1     <U 

O    > 

•H 

C     4-t 

o   o 

•H     <U 

•U    iH 

CO    rH 

U     O 

•U    U 

to  - 

3 

H  xi 

H    C 

M     CO 

r~- 

CO 

CO 

u 

3 

M 

■H 

ft. 

69 


e 

-» 

' 

' 

' 

' 

1 

1 

1 

2    «-■    cr    cd    < 

f-H 

J     £ 

*       >. 

4-1      CO 

)// 

•H     C 

/V\                                  /                                                                       rH 

i  of  Algor 
iundant  Bi 
=  16) 

I    h   e    o    < 

■¥ 

/ 

1 

*—    \v/     \ 

!_! 

' 

l     y 

yOv                                >w 

>/>/\                     X 

> v ^ 

DIGIT 

PRODUCT 

GENERATOR 

2      —     CE     00      < 

/                >.  \^ 

r 

sntatioi 

sing  Re< 

(Radix 

w 

*          \           \ 

■> 

*  -     £ 
►  £       at 

u 

CO 

3      \\ 

7          S\\ 

*-       \\ 

X 

•+— 

Implem 
FMA,  u 

2    •_   a:   m    < 

O 

> 

CO 

the 
tion 
Gener< 

■  ,<  — 

c~ 

5a^." 

y-i    o 

•i-i 
PQ 

O     3     X 

'  1 

I     1 

1  1 

}  1 

1     1 

r  1 

' 

4-) 

C 

CO 
T3 
C 

3 
T3 

<U 
P£j 

C 
•H 

0) 
M 
CO 

*     • 

-e- 

ration 
roinst 
t  Matr 

o 
2E    *"    DC    03    < 

^ 

"6" 

r     »  _° 

*o 

*          v 

4J      O      CJ 

♦— »  °\    / 

HI  iH    3 
3    S    T3 
>H            O 

6    \Z/ 

ss\.                    /"" 

HlH     ^ 

•    -« 

"6 

2    »<    E    00    <t 

X   /        X.                    /^ 

*-£ 

M     O    Pu 

/             \          / 

"8" 

<•                        \ 

3— «._-\\      /            \ 

1 

'     < 

r 

i     r 

00 

CO 

CU 

J-l 

•H 

^Ov                       ^v 

y^\\                      X 

y v ■ 

DIGIT 

PRODUCT 

GENERATOR 

*  s4 
"6" 

5     —     (t     CD     < 

/            N.  N. 

w 

5'      v       ^ 

>l 

c 

►  e  *  •> 

B 

Oh    - 

«      xx 

* >\\ 

*~~  Vr  xx 

u       \ 

2   «   or  oo   < 

J 

H 

4-1 

<u 

4-1 

o 

Vt. 

j 

»' 

'T       f 

P       ' 

o 

S5 

70 


k-1    /v    .    \    k-1 


w<  =  I     2   I  P.    o   -  I   2    I  *4   '  m  (3-6.9) 

1  vio        1=0  *'v"£    v=0    U-0  *A    Jv-J 


«'« -  T  ^  l  <  J  V  .-  !♦;..; 

1  X   v=k       U-0  *'V  7  v=k       U=0  XZ  Jv-i 


(3.6.10) 


From  Equations  (3.6.9)  and  (3.6.7),  we  have 


w.l     =1.2°+  2.21  +  ...+  (.v+1)  2V  +...+(k).2k"1 

l '  max 

=  2k(k-l)  +1  (3.6.11) 


Further 

|w,|    +  2k  |t.  J    =  (2k-l)2 
1  i'max      '  i-l'max 

V  l^i-lLax  "  (2k-D2  -  (2k(k-l)+l) 

2k 

=  2k  -  (k+1)  (3.6.12) 


In  summary,  the  digit  algorithm  1  for  the  microinstruction  FMA  can  be 
described  as  follows: 

i)    Perform  transformation  f   by  recoding  the  product  of  digits  c|>  . 

P  P 

and  m.  into  w.,  t.  n  such  that  w.  and  t.  -  are  given  by  Equations  (3.6.9) 
j       l   i-1  l      l-l     °      J      n 

and  (3.6.10)  respectively. 

ii)   Perform  transformation  f~  in  a  k-stage  linear  cascade  of  (k+1) 
input  redundant  binary  adder. 

The  design  of  the  multi-input  redundant  binary  adder  is  discussed  in 
Section  3.6.2.1.3. 
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3.6.2.1.2  Algorithm  2  -  In  this  algorithm,  the  transformation  f 

P 
recodes  the  product  <}>.  .  m  into  digits  w.  and  t._-  such  that 


wt   e  {(r-l),(r-2), 1,0,1, ...., (r-2) , (r-1)} 

and     t  ,  e  {(r^f), 1,0,1, ,  (7^) } . 

p 
Clearly,  the  recoded  digits  w  ,  t    contribute  only  one  redundant  binary 

input  to  each  MIRBA  of  the  linear  cascade. 

Then  the  transformation  f„  is  performed  in  the  k-stage  linear  cascade 
of  3  input  MIRBAs.  This  is  illustrated  in  Figure  3.9. 

Note  that,  in  algorithm  2  the  number  of  inputs  to  the  MIRBAs  is 
always  three,  independent  of  the  value  of  k. 

The  LSI  implications  of  algorithms  1  and  2  are  discussed  later  in 
Section  4.2.2.5. 

3.6.2.1.3  Design  of  a  multi-input  redundait  binary  adder  (MIRBAl  -  A 
MIRBA  is  a  limited  carry/borrow  propagation  adder  which  accepts  several 
redundant  binary  inputs  (digit  set  {1,0,1})  and  produces  one  redundant 
binary  output  (with  appropriate  adder  'Transfers'  for  more  significant 
adjacent  adder  stages) . 

Definition 

Let  us  define  a  new  parameter  a  .   The  redundant  binary  output  of 
any  MIRBA  is  dependent  on  the  'Transfers'  (the  composite  term  for  carry/ 
borrow)  input  to  that  MIRBA.   In  a  redundant  number  system,  the  'Transfers' 
are  functions  of  'primary'  inputs  (other  than  'Transfer'  inputs)  to  only 


72 


* 

■> 

6 

1-1 
o 

o" 

""> 

- 

I 

1 

' 

5    —   er    cd    < 

■4—1 

2    h    K    O    < 

*~ 

,    '!       '     W     "     1 

! 

2    «    <r    on    < 

2 
O 

DIGIT 

PRODUCT 

GENERATOR 

" 

r 

- 

CO 

2    h    k    id    < 

J 

4-1 

o 

>v- 

C 

>v- 

3 
+J 
•H 

i 

'      1 

'        1 

1 

1 

- 

00 

o 

5     m    K     0D     < 

' 

"cf< 

r  *f 
"o 

•  -i 
"o" 

*     OJ 

*  " 
"6~ 

#— J 

J 

1 

g      •   o 

- 

oo 

-9-" 

C/3 

2    ►-•   or    od   <* 

•H 

o" 

CO 

f    ' 

!     ! 

r    ' 

r    ' 

~\ 

•H 

' 

f 

PL,    v 
4-1 

2    -.    tr    co    < 

2 
O 
a: 

DIGIT 

PRODUCT 

GENERATOR 

l_i 

>i 

<D 

o 

14-1 

c 

- 

5    «     t     ID    < 

4-1 
CJ 

3 
T3 

«*— « ._" 

J 

o 

o 

u 

o' 

I 

d 

o* 

1-1 

<u 

4    . 

-1 

a.  - 

i 

o 

S3 

vO 


•H 

II 

M 

0 

X 

M 

•H 

H 

T) 

<n 

CD 

Bj 

«-i 

s — ' 

o 

G 

en 

o 

*  ■ 

■rH 

c: 

■U 

pci 

cC 

4-J 

M 

c 

C 

CD 

•H 

E 

Cfi 

01 

D 

H 

Q. 

< 

CD 

C 

,X3 

0 

(j 

■H 

u 

14-4 

V 

o 

3 

H 

c 

4-1 

0 

CO 

•H 

c 

4J 

•H 

CO 

O 

Vj 

n 

4-1 

u 

on 

•H 

3 

S 

M     O 
CO 

a) 

00 
•H 
f»4 


73 


a  limited  number  of  adjacent  less  significant  MIRBAs.   a  denotes  the 
number  of  such  adjacent  MIRBAs  whose  'primary'  inputs  in  cooperation  with 
the  primary  inputs  of  a  given  MIRBA  determine  the  output  of  that  MIRBA. 

The  radix-2  digit  processing  logic  in,  say  PE  consists  of  a  k 
stage  linear  cascade  of  (k+1)  input  MIRBAs.   Except  for  the  most  significant 
MIRBA  in  k-stage  cascade,  the  primary  inputs  to  the  MIRBAs  in  PE  are 
functions  of  radix-2  operand  digits  in  PE  and  PE    (accumulator 
digits  a.,  multiplier  digit  m.  and  multiplicand  digits  <J>  ,  <t>.,,).   Thus 
a.  is  related  to  a  by  Equation  (3.6.13) 


a .  = 
J 


*b-i 


+  1  (3.6.13) 


3.6.2.1.3.1  Rohatsch's  [39]  technique  -  This  is  a  deterministic  and 
explicit  transformation  procedure  which  converts  a  given  input  digit  set 
into  the  required  output  digit  set  by  a  series  of  simple  transformations. 

In  using  this  technique,  one  generally  proceeds  backwards;  namely, 
consider  the  transformations  going  from  output  set  to  input  set.   The 
basic  concept  of  Rohatsch's  technique  is  very  simple: 

i)    Take  the  desired  output  set  S,  find  two  or  more  sets  An,  A- , 

A0 , . . . , A  such  that 
2  n 

S  =  A  +A  -  +  ...+  A«  +  A,  +  A_ . 

n    n-1  2    10 

ii)   Form  the  input  set  M  where 

M  -  vn   A  +  r1*"1  A  ,  +  ...+  A,  r1  +  An 
n         n-1  1       0 

where  r  is  the  radix  of  the  adder.   In  our  case,  for  MIRBA,  r=2 
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iii)   If  necessary,  repeat  the  steps  i)  and  ii)  (using  the  last  in- 
put set  as  the  new  output  set)  as  many  times  as  is  required  to 
generate  a  set  which  includes  the  desired  input  set. 
Steps  i)  and  ii)  above  together  constitute  an  n-th  order  Simple  Transforma- 
tion (referred  to  as  S.T.).  For  the  contiguity  of  sets  M  and  S,  A  ,  A  ,, 

n'   n-1 

. . . ,A_  must  be  contiguous  and  the  number  of  distinct  digits  in  sets  A., 
n-1  >  i  >  0  should  be  greater  than  or  equal  to  r. 

Using  the  above  approach,  we  find  that  for  k  >  2,  a  (k+1) -input 
MIRBA  requires  a  series  of  three  S.T.s.   Figure  3.10a  shows  one  such 
four  level  (each  level  indicated  by  a  box)  adder  which  is  applicable  for 
k  <_  5.   In  this,  level  1  and  level  2  perform  first  order  S.T.s  whereas 
level  3  represents  a  2nd  order  transformation.   If  level  3  performs  a 
third  order  or  fourth  order  transformation,  such  a  four  level  adder 
would  be  applicable  for  k  <_  9  and  k  <_  11  respectively. 

It  is  interesting  to  note  that  if  level  2  achieves  a  2nd  order  S.T. 
and  level  3  constitutes  a  6th  order  S.T.,  then  the  four  level  adder  can 
be  used  to  sum  as  much  as  51  redundant  binary  {1,0,1}  inputs.   This  is 
shown  in  Figure  3.10b. 

However,  the  logic  design  of  the  bottom  two  levels  is  highly  com- 
plicated for  k  >_  5  if  they  are  to  be  implemented  in  two  or  three  logic 
levels.   In  practice,  the  technique  is  to  break  down  the  bottom  level 
structure  into  equivalent  simpler  structures  frequently  at  the  cost  of 
increasing  the  number  of  levels,  as  shown  in  Figure  3.10c  for  k  =  5. 

In  this  adder  structure,  a   is  given  by 

b   q_1 
a  =  I       n 

v=l 
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Note:   Entries  in  the  box  show  the  allowed  output  digit  set  values, 


Figure  3.10a  Illustration  of  the  Algebraic  Design 
of  a  MIRBA,  using  First  Order  Simple 
Transformations  only. 
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0,1 
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LEVEL  4 
51,.  ..,1,0,1...  .,77 


TT~3T 
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Note:   Entries  in  the  box  show  the  allowed  output  digit  set  values, 


Figure  3.10b  Illustration  of  the  Algebraic  Design 
of  a  MIRBA  using  Higher  (>2)  Order 
Simple  Transformation. 
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3  INPUT,  2  OUTPUT 
REDUNDANT  BINARY 
ADDERS 


1,0,1     1,0,1    1,0,1  1,0,1     1,0,1     1,0,1 


Figure  3.10c  Algebraic  Design  of  Bottom  Level 
(Level  A)  Box  of  Figure  3.10a. 
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where  q  =  number  of  levels  in  MIRBA 

n  =  order  of  S.T.  performed  by  adder  level  v. 


Table  3.1  shows  the  values  of  a  and  a.  for  various  values  of  k, 

3 


for  a  (k+1)- input  MIRBA, 


Table  3.1 
Values  of  a  and  a.  for  Various  (k+l)-Input  MIRBA  Configurations 


radix 
r 

0k 

r  »  2 

k 

Rohatsch's  Technique 

log-sum  tree 

RBA-3,RBA-2  tree  structure 

b 

a 

a . 

b 

a 

b 

a 

-1 

4 

8 
16 
32 
64 
128 
256 

2 
3 
4 
5 
6 
7 
8 

3 

4 
4 
4 
5 
5 
5 

2 
2 
2 
2 
2 
2 
2 

4 
4 
6 
6 
6 
6 
8 

3 

2 
3 
2 
2 
2 
2 

3 
5 
5 
5 
6 
6 
6 

3 

3.6.2.1.3.2  Log-sum  tree  technique  -  A  conceptually  simple  approach 
is  to  realize  the  (k+1)  input  MIRBA  by  a  log-sum  tree  structure  of  two 
input  redundant  binary  adders  (RBA-2) .   For  a  (k+1)  input  MIRBA,  the  tree 
structure  has  t  levels  of  Borovec  Units  such  that 


t   = 


1og2(k+l)"| 

and   the  number  of   BUs  required    is   k.      Figure  3.11   shows   the  log-sum  tree 
structure  for  a   five   input  MIRBA. 
In   this   configuration, 
b 


a      =   2t   =   2riog2(k+l)1      and 
2llog?(k+l)|    -lj   +  1b 


a .    = 
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The  value  of  a.  for  various  values  of  k  is  tabulated  in  Table  1.   From 

the  table  we  find  that  for  k=2  and  k=4,  that  is,  radices  A  and  16,  the 

value  of  a .  =3  and  a.  =  2  for  all  other  values  of  k.   Since  minimum 
3  J 

value  of  a .  is  desirable,  a  different  arrangement  of  BUs  as  described  in 
third  approach  given  next  can  be  used  to  achieve  a.  =  2. 

3.6.2.1.3.3  Tree-structure  using  RBA-3s  and  RBA-2s  -  In  this  con- 
figuration, 3-input  redundant  binary  adders  (RBA-3)  and  RBA-2s  are  con- 
nected in  a  tree  structure. 

An  RBA-3  consists  of  two  BUs,  a  D-element  and  a  C-element  arranged 
as  shown  in  Figure  3.12.   The  C-element  composes  two  binary  inputs 
{0,1;  0,1}  into  one  redundant  binary  {1,0,1}  output.   The  lower  BU  in 
combination  with  the  C-element  and  the  D-element  acts  as  a  redundant 
binary  (3,2)  counter.   The  upper  BU  forms  the  sum  of  the  sum-outputs  of 
the  lower  BUs  and  the  'Transfer'  output  of  the  lower  BU  of  adjacent  less 
significant  RBA-3. 

For  a  design  of  a  (k+1)  input  MIRBA,  RBA-3s  are  used  whenever  they 
can  be  fully  utilized,  that  is,  three  inputs  are  available  for  addition; 
and  RBA-2s  are  used  when  only  2-inputs  are  to  be  added  at  any  level  of 
the  tree  structure.   (An  exception  occurs  for  k=3  where  the  log-sum  tree 
technique  is  necessary.)   Figure  3.13  shows  a  5-input  MIRBA  using  RBA-3s 
and  RBA-2s  as  building  blocks. 

The  number  of  BUs  required  in  this  technique  is  also  k  for  a  (k+1)- 
input  MIRBA.   The  number  of  BU  levels  is  also  2  |"log2(k+l)~]  .   Table  3.1  shows 

the  values  of  a  and  a.  for  various  of  k.   It  shows  that  a.  =  2  for  all 

3  J 

values  of  k  except  k=3 . 
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1,0,1  1,0,1      1,0,1 


Figure  3.12  Arithmetic  Structure  of  an  RBA-3 . 
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The  tree  structure  configurations  described  in  3.6.2.1.3.2  and 
3.6.2.1.3.3  have  the  following  advantages  compared  to  Rohatsch's  tech- 
nique. 

a)  It  is  more  general  and  has  the  same  configuration  for  any  value 
of  k. 

b)  It  makes  use  of  only  one  kind  of  cell,  that  is,  Borovec  Unit 
for  the  implementation  of  MIRBA. 

c)  The  various  BUs  are  uniformly  and  regularly  interconnected. 
Because  of  b)  and  c)  above,  this  implementation  meets  the  LSI  con- 
straints of  structure  regularity  and  minimum  cell  number  type. 

In  terms  of  our  notation  of  Section  2.4, 


„„F.   =  ™,AF.  ..     JU-  i,   1  <  i  <  n 
FMA  i     FMA  i-1     ~*r      >  _   _ 


FMAF0  =  mj 


where  _,AF.   =  modifier  value  which  is  sent  by  PE^  along  with  micro- 
FMA  i  i 

instruction  FMA  to  PE.,.,. 

l+l 

_..F0  =  modifier  value  sent  by  MCU  along  with  microinstruction 

FMA  to  PE.. 

m.   =  Multiplier  (or  Quotient)  digit. 
FMAGi+l  "  <<■  Ci> 

P 

where    t   =  Product  Transfer  to  PE  from  PE   . 
t   =  MIRBA  Output  Transfer  from  PE 

aFMA  =  2' 
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3.6.3  Multi-sum  (MS)  microinstruction  -  This  microinstruction  forms 
the  sum  of  N  digit  vectors  where  N  is  the  number  of  inputs  of  a  MIRBA  used 
in  the  implementation  of  microinstruction  FMA.   N  depends  on  the  digit 
algorithm  used  for  FMA.   In  any  case  N  <_  k+1. 

The  digit  level  transfer  function  is  given  by 

12  N       A       A 

a:  =  xT  +  x,  +  ...  +  x.  -  r  tA  .  +  t  (3.6.13) 

i      i      i         i      l-l    i 


where  a^  ,  x±   e  { (r-1) , (r-2) ,... ,1,0,1, ..., (r-1) } 

If  we  designate  the  set  of  arithmetic  transformations  performed  by  a 
Borovec  Unit  as  Borovec  Unit  Transformation  (BAT),  then  the  transfer  func- 
tion in  Equation  (3.6.13)  is  realized  by  a  series  of  flog„N|  BATs.   This 
is  discussed  earlier  in  the  design  of  a  MIRBA  in  Section  3.6.2.1.3. 

Implementation 

The  MS  digit  algorithm  can  be  implemented  by  making  use  of  MIRBAs 
already  existing  in  the  digit  processing  logic  of  the  processing  element. 
This  is  shown  in  Figure  3.14. 

The  MS  microinstruction  can  be  represented  in  the  notation  of  Section 
2.4  as  follows. 


MSFi      =     MSFi-l      '     Vl        1iiin 


MSF0      =      <Null> 


G  -      tA 

MS    i+1  i 


<  .- 
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b 

and     aMS  =   T 


=  2   for  radix-2k  <  32 


=   1   for  radix  >  32. 


3.6.4  Normalize  Recode  (NR)  microinstruction  -  This  microinstruction 
is  used  to  normalize  an  operand  according  to  Definition  3  given  in 
Section  3.5.1. 

Given  an  operand  of  the  form 

A.  -  *       X-       Xn        •••       X  ,        •••       X  .       X   ,    ,   ..        •   •   •       X 

12  l  j      j+1  n 


such  that 


xj    >   0  1  <_  i  <   j-1, 


x.    =   0, 


and  |x.  I    >_  0  j+1  <^  i   <_  n 

the  NR  microinstruction  transforms  the  operand  X  into  an  algebraically 
equivalent  operand  X' 

X"  =  .  00  ..  0  <  x'  .  .  .  .  x"  .  .  .  x'  .  xC,,  ...  x-* 
h  h+1      i      j    j+1      n 

where  sign  (x')  =  sign  (x  )  ,   h  <_  i  <^  j-1     h  >_  1 

x:   =  x.   =  0 
J      3 

and        xj*  =  x,,   j+1  <^  k,  <_  n 

For  example,  if  radix  r=10,  then  the  numbers  .199704,  .1909704,  .179412 
and  .109018  would  be  recoded  respectively  as  .001704,  .0109704,  .160608 
and  .109018. 


0  if  +ve 

1  if  -ve 
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Digit  Algorithm 
Let 

S_   =  Sign  of  the  operand 

S    =  Sign  of  digit  x.  =  < 

r   =  radix 
|x. |   =  magnitude  of  digit  x. 

The  digit  algorithm  is  given  by  the  flowchart  of  Figure  3.15.   Initially 
S   is  known  and  is  equal  to  S..  . 

In  terms  of  the  notation  of  Section  2.4, 

F   =S       1  <  i  <  1  -  2 

NR  i    a0P       -   -  J 


where 


=  <Null>      i  >  j  -  1 


sop    =    si       for    i  -  °* 


j  is  the  index  of  first  zero  digit  in  the  operand  d-vector 

G     =  x       i  <  i  <  i-1 
NR  i+1     xi+l     J  -   -  J 


and         aNR  =  1. 


3.6.5  Assimilation  Recode  (AR)  microinstruction  -  This  microinstruc- 
tion is  used  to  assimilate  (convert)  a  signed-digit  redundant  operand  into 
an  algebraically  equivalent  operand  such  that  all  the  digits  in  the  re- 
coded  form  are  of  the  same  sign  as  the  sign  of  original  operand. 


In  the  actual  implementation  of  the  digit  algorithm  for  NR,  MDG 

....   NR  i+1 
information  consists  of  S.,,  and  Z. .,  where  S#11  is  a  bit  carrying 

l+l      l+l        l+l 

sign  information  of  digit  x.,,  and  Ztl1  is  also  a  bit  whose  two  states 

°    l+l      l+l 

indicate  where  digit  x.in  is  zero  or  not.   (cf.  Section  4.3.2.3.6.2) 
°    l+l 
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Figure  3.15   Flowchart  of  the  Digit  Algorithm  for 
Microinstruction  NR. 
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Given  a  signed -digit  operand  of  the  form 

X  S  .  X;L  x2  ...  Xl  0  x.+2  ...  xk  00  ...  0  XfcH  ...  xn 


such  that 


sign  (x.+.)  =  sign  (x.+2)  and 

sign  (x.  ..)  =  sign  (x,   )  =  =  sign  (x,     )  =  sign  (x.   ) 


That  is,  all  the  zero  digits  in  the  d-vector  of  number  X  have  the  same 
sign  as  that  of  the  first  nonzero  digit  to  the  immediate  right  of  a  string 
of  zeros   (of  length  >_  1),  the  AR  digit  algorithm  would  recode  X  into  X' 

j\         —       •   X.   Xa   •  •  •   A,   A...,   X  .   .  f~,      •••   A.   X.  .  -   •  «  •  X«  .  _   •  •  •   A 

12      l   l+l   i+2      k  k+1      k+£      n 
such  that 

X  =  X'  and 
sign  (xp  =  sign  (x.+,)  =  sign  (x)   V1*  1  1  *■  ln 

Digit  Algorithm 

Let  S_   =  sign  of  the  operand 

S.    =  sign  of  the  digit  i 

S    =  sign  of  the  digit  i+1 

x    =  digit  i  of  operand  X 

The  digit  algorithm  is  almost  identical  with  that  for  NR  micro- 
instruction except  that  the  microinstruction  acts  on  all  the  digits  of 
the  operand.   It  is  given  by  the  flowchart  shown  in  Figure  3.16. 
In  the  notation  of  Section  2.4, 
ARFi   =  S0P     l  i  *  <  » 
=  S1 

ARGi+l  =  Si+1    !  1  i  1  n 

a._      =  Variable  depending  on  the  d-vector 
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Pass  the  micro- 
instruction to 
PE, 


'i+1 


OP 


YES 


NO 

^S      »   S       ^^                    ^S      -   S       \^YES 

< 

' 

^Y^NO 

\*±\ 

-  r-lxj 

|xj    -  r-l-lxj 

|x±|    -    IxJ    -   1 

1      .. 

\ 

Figure  3.16   Flowchart  of  the  Digit  Algorithm  for 
Microinstruction  AR. 
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4.   LOGIC  DESIGN  OF  THE  PROCESSING  ELEMENT 

4.1  Introduction 

In  this  chapter,  the  logic  design  of  the  Processing  Element  (PE)  is 
developed  and  discussed  in  detail.   The  major  components  of  the  PE  are 
the  Register  File  for  the  temporary  storage  of  active  operands,  the  Digit 
Processing  Logic  (DPL)  which  is  essentially  a  large  combinational  logic 
circuit  and  the  Processing  Element  Control  Logic  (PCL)  which  supplies 
the  control  signals  in  proper  temporal  order  to  condition  the  combina- 
tional DPL  to  execute  the  various  microinstructions.   The  major  consider- 
ations in  the  logic  design  of  the  PE  are  the  LSI  technology  constraints: 
namely,  the  PE  should  require  as  few  external  pins  as  possible  and  that 
the  logical  organization  of  the  PE  should  have  structural  uniformity  and 
regularity. 

Section  4.2  discusses  the  logic  design  of  data  path  structure  of 
the  PE  and  in  Section  4.3  is  given  the  logical  organization  and  detailed 
design  of  the  control  algorithms  for  the  generation  of  control  signals. 
Finally,  Section  4.4  discusses  the  logic  complexity  of  the  DPL  and  the 
PE  control  logic  in  terms  of  the  number  of  gates  and  the  external  pins 
for  the  PE  module  as  a  function  of  the  bit  width  of  the  PE  module. 

4.2  Block  Diagram  Description  of  a  Processing  Element 

Figure  4.1  shows  the  schematic  block  diagram  of  a  Processing 
Element.   It  consists  of  three  main  components — Digit  Processing  Logic 
(DPL),  Register  File  and  Control.   The  Register  file  comprises  a  set 
of  digit-wide  registers  which  are  used  to  hold  the  operand  digits  and 
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result  digits.   The  registers  could  also  be  used  to  hold  intermediate 
result  digits  temporarily.   Inter-register  transfer  microinstructions 
operate  on  these  registers.   The  Digit  Processing  Logic  is  essentially 
combinational  processing  logic  (along  with  some  storage  for  G-informa- 
tion)  and  is  used  to  process  the  microinstructions  of  the  PE.   The  DPL 
operates  on  the  operand  digits  stored  in  the  register  file  of  the  PE 
and  G- information  from  its  right  neighboring  PEs.   It  also  generates 
the  G-inf ormation  for  its  left  neighbor  PEs.   The  Control  issues  the 
timing  control  signals  to  the  processing  logic  for  sequencing  the 
various  steps  of  the  digit  algorithms  for  the  microinstructions.   It 
also  coordinates  the  actions  of  the  PEs  by  accepting  the  micro- 
instructions and  G-inf ormation  from  neighboring  PEs  and  by  transmitting 
the  microinstructions  to  the  right  neighbor  and  generating  G-inf ormation 
for  its  left  neighbor  PE. 

4.2.1  Register  file  -  The  register  file  is  a  set  of  registers  that 
are  used  to  hold  the  operand  and  result  digits.   Each  PE  retains  one 
digit  of  each  of  the  active  operands.   Each  register  is  (k+1)  bits  long 
to  hold  the  k-magnitude  bits  and  one  sign  bit  of  one  SM  -encoded  radix-2 
digit.   There  must  be  at  least  three  registers  in  a  PE:   an  accumulator 
register,  a  multiplier-quotient  (MQ)  register,  and  an  operand  register. 
However,  the  multi-sum  microinstruction  MS  requires  N  operands  and  hence 
N  storage  registers  where  N  represents  the  number  of  operands  the  radix-2 
adder  in  the  DPL  is  capable  of  adding  simultaneously.   It  was  shown 
earlier  in  Sections  3.6.2.1.1  and  3.6.2.1.2  that  the  radix-2  adder  adds 
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either  (k+1)  or  3  operand  digits  depending  on  how  the  algorithm  for 
microinstruction  FMA  is  implemented.   For  the  present  discussion,  we  shall 
assume  that  the  adder  is  capable  of  adding  (k+1)  operand  digits  simultane- 
ously and  that  k  =  4.  The  register  file  would  thus  contain  at  least  five 
registers.  Additional  registers  can  be  added  to  the  register  file.   One 
possible  use  of  such  registers  is  to  hold  the  intermediate  results  which 
are  needed  so  soon  after  they  are  calculated  that  storing  them  and  re- 
trieving them  from  memory  would  unnecessarily  delay  the  processing.  The 
number  of  desirable  intermediate  result  registers  is  determined  by  the 
method  of  communicating  between  memory  and  the  arithmetic  unit,  the 
number  of  extra  pins,  if  any,  required  for  the  identification  address  of 
these  registers  and  their  contribution  to  the  overall  logic  complexity 
of  the  chip.   Figure  4.2  shows  an  internal  register  file  containing  five 
registers  INR1,  INR2, . . . ,INR5.   In  this  thesis,  registers  INR1,  INR2  and 
INR3  act  respectively  as  Accumulator,  operand  register  and  MQ-register. 

The  registers  in  the  register  file  are  loaded  from  a  buffer 
register,  IBR  whose  contents  are  determined  by  the  Internal  Register 
Input  Bus  Selector,  sRIB  in  the  Digit  Processing  Logic  (discussed 

later  in  Section  4.2.2.7.3).   Similarly,  the  contents  of  the  registers 
are  inputed  to  the  DPL  either  directly  or  through  an  Output  Bus  Selector 
sROB,  also  in  the  DPL.  The  control  signals  gINR[x] ,  x  =  1,2,..., 5  for 
loading  the  registers  are  provided  by  the  local  control  in  the  PE. 

Because  of  the  bus  mechanism  for  input ing  and  output ing  of  data  in 
the  register  file,  any  whole  operand  register  (consisting  of  correspond- 
ing operand  digit  registers  distributed  in  the  various  PEs)  can  be  used 
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the  PE. 


96 


as  a  shift  register,  capable  of  shifting  one  digit  at  a  time.  This  is 
made  use  of  in  microinstructions  LS  and  RS.   Rl,  R2,...,R5  denote  the 
outputs  (contents)  of  registers  INR1,  INR2, . . . ,INR5,  respectively. 

4.2.2  Logic  design  of  digit  processing  logic 

4.2.2.1  Block  diagram  description  of  PPL  -  Figure  4.3  shows  the 
data  flow  structure  of  the  Digit  Processing  Logic  (DPL)  in  block  diagram 
form.   It  consists  of  three  major  components — the  Digit  Product  Generator, 
DPG,  a  radix-2  multi-input  adder  MIAD,  and  a  Digit  Sum  Encoder,  DSE.   In 
addition  to  these  three  main  components,  there  are  selector  networks  sADR 
for  the  adder  input,  sDSE  for  the  Sum  Encoder,  sROB  for  selecting  the  out- 
put of  internal  registers  in  the  register  file  and  sRIB  for  selecting  the 
inputs  to  the  in-bus  of  the  register  file  and  sTOP  for  selecting  the  con- 
tents of  'Transfer'  Output  Port  (TOP).   Besides,  there  are  two  registers 
GIR  and  APR  for  storing  the  G-inf ormation  and  multiplicand  digit  from 
the  adjacent  right  neighbor  PE. 

The  Digit  Product  Generator,  DPG  forms  the  product  array  in  re- 
dundant  binary  form,  of  two  SM  -encoded  radix-2  digits  m.  and  <f>^. 
The  digit  <J>.  comes  from  the  operand  register  INR2  via  the  output  bus 
selector  sROB  and  the  multiplier  digit  m.  is  inputed  to  the  DPG, 
from  the  microinstruction  register  MIR  in  local  control  logic  of  the 
PE. 

The  multi-input  adder,  MIAD  adds  the  w  columns  of  the  redundant 
binary  product  array  formed  by  DPG  and  the  collective  product  transfer 


The  necessity  of  the  register  APR  for  the  multiplicand  digit  from 
the  adjacent  PE  would  be  clear  from  the  discussion  in  Section  4.2.2.5. 
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P  *      * 

t . .   The  MIAD  is  also  used  to  add  the  two  operand  digits  a .  and  <j>   from 

the  registers  Rl  and  R2  for  the  microinstruction  SS  and  to  add  the 

operand  digits  for  the  microinstruction  MS.   The  radix-2  multi-input 

adder  is  made  up  of  k-stages  of  MIRBAs. 

The  Digit  Sum  Encoder  DSE  converts  the  redundant  binary  sum  output 
of  adder  MIAD  to  the  SM  format  for  local  storage  in  the  accumulator 
register  INR1  of  the  register  file  or  transfer  out  of  the  PE.   The  DSE 
is  also  used  in  the  microinstructions  AR  and  NR  for  forming  the  radix 
and  diminished  radix  complement  of  the  magnitude  bits  of  the  accumulator 
register  INR1  and  also  for  subtracting  unity  from  the  magnitude  of  the 
accumulator  contents.   In  addition,  sDSE  and  DSE  are  made  use  of  in 
inter-register  transfer  microinstructions  TD  and  TI  for  direct  and 
reversed-sign  inter-register  transfer. 

The  Adder  Input  Selector  sADR  routes  appropriate  data  in  redundant 
binary  form  to  the  MIAD  inputs  depending  on  the  microinstruction  the 
PE  is  executing  at  that  time. 

The  selector  sDSE  selects  the  appropriate  input  to  the  encoder  DSE. 

Also  shown  in  the  Figure  4.3  are  input  and  output  ports  designated 
as  TIP  ,  RIP.  and  TOP  ,  ROP.,  respectively.   The  input  port  TIP  and 
RIP  ,  respectively  carry  the  'transfer*  (carry  or  borrow)  from  adjacent 
MIAD  and  the  contents  of  some  register  in  the  register  file  of  the 
adjacent  PE   .   These  ports  essentially  carry  the  'G-information'  from 
the  adjacent  PE  -  for  the  microinstruction  that  is  being  executed  by 
the  present  PE  .   The  output  ports  TOP  and  ROP.  are,  however,  shared 
to  carry  the  'G-information'  for  the  left  neighbor  PE..  and  also  the 
address  and  data  information  respectively  for  the  local  operand  memory 
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PEM  .   This  is  made  use  of,  for  fetching  data  from  and  storing  data  to 
the  PEM  under  the  control  of  microinstructions  LPM  and  SPM. 

The  selector  sTOP  selects  either  the  'transfer'   information  from 

MIAD  or  the  address  bits  and  Read/Write  bit  for  PEM.. 

1 

The  details  of  the  logic  design  of  the  various  blocks  described 
above  are  discussed  next  in  the  following  sections.   Since  the  logic 
complexity  of  the  major  components  DPG,  MIAD  and  DSE  and  sADR  are 
dependent  on  the  choice  of  logic  vector  encoding  for  the  redundant 
binary  digits,  the  three  logic  vector  encodings  considered  for  study 
are  described  first.   It  is  followed  by  the  logic  design  details  of 
the  major  components. 

4.2.2.2  Choice  of  logic  vector  encodings  -  As  mentioned  in  Section 

3.4.1,  the  redundant  binary  (RB  )  mode  of  encoding  for  a  radix-r  signed 

digit  is  used  for  the  arithmetic  processing.   Each  redundant  binary  digit 

requires  2  bits  for  representation.   There  are  nine  distinct  ways  under 

permutation  and  negation  [40] ,  of  assigning  three  values  (1,0,1)  to  four 

states  of  two  binary  logic  variables.   Of  the  nine  ways,  three  encodings 

were  chosen  for  this  study  because  they  are  the  simplest  as  far  as  the 

conversion  from  the  SM  mode  to  the  chosen  encoding  for  RB  mode  is  con- 

r  °  r 

cerned.   Let  a  radix-2  ,  signed  digit  x.,  encoded  in  SM  mode,  be  rep- 
resented by  a  k+1-tuple  (S.,x.    ,x    ,...,x.  )  such  that 

1   k-1   1k-2      X0 


Si   k_1         1 
:.      -   (-D  1   I       x.   .  21  x   z    {0,1}. 

3=0   Xj  j 


The  corresponding  RB  encoded  form  is  given  by 
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k_1   *     i 

j-o   J 


x    e  {1,0,1} 
J 


Let  the  redundant  binary  digit  x   be  represented  by  a  2-tuple  logic 

J 
vector  (x .   ,  x  )  where 
J     J 

X±  ,  x±   e  {0,1}. 
J     J 

The  three  logic  vector  encodings  for  the  redundant  binary  digit  x   con- 

j 
sidered  in  this  research  are  given  in  Table  4.1. 


Table  4.1 


Logic  Vector  Encodings 


Encodings 

Binary  2-tuple 
logic  vector 

LVE 

LVE2 

LVE3 

* 

Xi. 

* 

* 

Xi. 
.1 

0     0 

0  1 

1  0 

1    1 

0 
I 
1 
0 

0 

1 

D.C 

1 

0 

1 
0 

I 

The  conversion  from  SM  mode  to  the  encoding  format  LVE-  is  the 

simplest  and  is  equivalent  to  attaching  the  sign  S  of  the  SM  encoded 

k 
radix-2  digit  to  each  magnitude  bit  x.   individually.  The  conversion 

for  the  three  encodings  are  given  by 
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LVE   :   x    =  X,   "  x 
j       j     J 


xi   =  Xi   ®  Si  (4,1) 

j       j 


Xi.   =   Si 


J 

where  0  stands  for  exclusive — OR 


or  x.   =  S.  A  x. 

i.      11. 
j  J 


Xi.   =  Si  -  Xi. 


(A. 2) 


(4.3) 


* 
LVE,  :   x.   =  x.    2x,   with  x.   x.   disallowed 
__1      ij      ij     *■,       ij   ij 

X.     =   Xj 

1 .      1. 

J     J 

X-   =  S.  -  x. 
i.      i    i. 
3  J 

*        xi. 
LVE-  :   x.   =   (-1)  J  .  x. 
3     l.  l. 

J  3 

x.   =  x.  (4.4) 

3 

For  the  encodings  LVE..  and  LVE~,  the  conversion  logic  requires  one 
exclusive-OR  gate  (Equation(4.1))  or  two  AND-gates  (Equation  (4.2)),  and 
one  AND-gate  (Equation  (4.3))  for  each  redundant  binary  digit  respectively. 
For  one  radix-2  digit  conversion,  there  are  k  redundant  binary  digits. 

Encoding  LVE~  is  essentially  a  sign  and  magnitude  encoding  of  the  re- 


dundant  binary  digit  x   by  the  2-tuple  (x .  »  x.  ).   Logic  variables  x.» 
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and  x   respectively  act  as  sign  and  magnitude  bits.  This  encoding  format 

J 
would  also  be  referred  to,  in  subsequent  discussion,  as  SM,  format  where 

subscript  b  indicates  radix-2  (or  binary) . 

4.2.2.3  Logic  design  of  RBA-2  (BU)  -  Let  £  ,  m  denote  the  redundant 

* 

binary  inputs  and  d  denote  the  redundant  binary  output  of  a  RBA-2. 

Further  let  £  ,  m  and  d  be  respectively  represented  by  the  logic  vari- 
able pairs  (X  ,  £  ) ,  (u  ,  m  )  and  (6  ,  d  ).   Also  let  t  ,  t  and  t  ,, 

V    V        V     V  V     V  V    V        v-1 

t  __  be  the  input  and  output  'Transfers'  of  the  RBA-2  as  shown  in  Figure  4.4. 
In  the  configuration  shown  in  Figure  4.4,  it  has  a  cascade  combination  of 
a  symmetric  subtracter  and  a  symmetric  adder.   Robertson  [40]  has  given 
the  logic  equations  for  the  symmetric  subtracter  and  symmetric  adder  for 
all  the  nine  distinct  encodings  referred  earlier.  Using  those  results, 
the  logic  equations  for  the  RBA-2  for  the  three  logic  vector  encodings 
being  considered  here  are  given  as  follows: 


LVE,  :    d  =  X  ©  £   ©  y   ©  m  ©  t 

1  V      V       V         V         V  ^    V 

6   =   t+ 

V  V 

t~  .  =   £  m  V  *      (£  vm  )  (4.5) 

v-1     v  v     v   v   v 

t    =  w  y  V  t~   (wVu) 

v-1      V   V     V     V    V 

w  =  X      ©  £  ©m 

V  V       V     V 
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This  is  schematically  represented  in  Figure  4.5.   Each  box  in  the 
figure  essentially  represents  a  full  adder  with  a  slightly  modified 
carry  function.   Figure  4.6  shows  the  logic  implementation.   This 
implementation  requires  22  two  input  NAND  gates  and  the  output  digit 
d  is  available  after  12  gate  delays.   Figure  4.7  shows  another  logic 
implementation  that  requires  27  gates  but  the  output  digit  is  available 
after  only  9  gate  delays.   Note  further  that  the  logic  in  Figure  4.7  is 
no  longer  made  of  two  identical  logic  substructures.  The  implementation 
of  Figure  4.6  allows  a  simpler  basic  cell  for  LSI  implementation  of  MIRBA 
at  the  cost  of  larger  logic  delay. 


LVE2  : 


d   =  £   ©  m   ©  t+  ©  t" 
6   =  t+  (A   0  t"  ©  m  ) 


V      V 


t   ,   =  X     V  SL  y 

v-1       V      V      V 


(4.6) 


:,=t    (U©y)vum)vy   m   £   . 

v-1        V       V   ^    V       V   V       V    V    V 


The  logic  implementation  of  this  adder  using  only  2  input  NANDS  is  shown 

in  Figure  4.8.   Thirty-four  two  input  NAND  gates  are  needed.   The  output 

*      * 

digit  is  available,  13  gate  delays  after  the  primary  inputs  l^   and  m^ 

are  stable  because  the  'Transfer'  input  t  is  available  7  gate  delay 


after  inputs  I    ,,  and  m  ,-. 

r     v+1      v+1 
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Figure  4.5   Schematic  Functional  Diagram  of  an  RBA-2 
using  LVE,  . 
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LVE3    : 

d 

V 

~ 

I 

V 

6 

V 

= 

+ 
t 

V 

d        =      £       ©    m      ©     t+    ©     t~ 

V  V  v 

(A. 7) 


tn=X£vym£ 
v-1  v     v  v     v     v 
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v-1  V  vv  vv  v      v      \ 


The  logic  implementation  of  this  adder  using  only  2  input  NANDS  is  shown 

in  Figure  4.9.   This  RBA-2  realization  requires  26  gates  and  the  output 

* 

d   is  available  13  gate  delays  after  the  primary  inputs  of  this  RBA-2  and 

its  adjacent  RBA-2  are  stable. 

Note  that  the  lower  gate  delay  for  the  sum  output  of  RBA-2  using  LVE, 
encoding  is  achieved  because  the  logic  variable  d   is  a  function  of  £  ,  m 

V  V     V 

and  t   only.   In  the  other  two  encodings,  d   is  dependent  on  t  also. 

V      J  °  V        r  V 

4.2.2.4  Logic  design  of  a  radix-2  multi-input  adder  (MIAD)  -  The 
radix-2  adder  MIAD  is  used  for  two  purposes;  1)  to  add  the  columns  of 
the  redundant  binary  product  array  formed  by  DPG  and  2)  to  form  the  sum 
of  the  operand  digits  for  microinstructions  SS  and  MS.   Figure  4.10  shows 

the  schematic  diagram  of  a  radix-16  (k  =  4)  MIAD.   It  consists  of  4  MIRBAs 

* 
each  of  five  inputs  each.   Each  MIRBA  has  two  outputs — one  a  MF  corre- 
sponding to  the  sum  of  all  the  five  inputs  (used  in  microinstructions  MS 
and  FMA)  and  a  SS  corresponding  to  the  sum  of  only  two  inputs — for  micro- 
instruction SS  to  the  left-most  BU  in  the  bottom  level  of  the  tree  of  BUs 
and  RBA-3s  making  up  a  MIRBA.   The  proper  data  is  routed  to  the  inputs  of 
MIRBAs  by  the  adder  input  selector  sADR.   MATE  and  MATD  are  the  encoders 
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and  decoders  used  for  reducing  the  pins  required  for  the  'Adder  Transfers' 
t    and  t  . 

In  general,  a  radix-2  multi-input  adder  consists  of  a  linear  cascade 
of  k  MIRBAs.   A  (k+1)  input  MIRBA  is  implemented  as  a  tree  structure  of 
RBA-2s  and/or  RBA-3s.   Each  MIRBA  requires  k  RBA-2s  and  are  arranged  in 
L  =  |log2(k+l)|  levels.   Therefore,  the  total  number  of  gates  required 

for  radix-2  adder  is  k  times  the  gates  needed  for  each  MIRBA  and  is  equal 

2 
to  k  times  the  gates  required  by  each  RBA-2  (plus  those  required  by  D  and 

C-elements  of  RBA-3s) .   Also,  the  total  gate  delay  for  the  sum  output  is 

L  times  the  delay  of  each  RBA-2  (ignoring  the  extra  delay  necessary  for 

A      A 
control  and  inter-PE  communication  of  'Adder  Transfers'  t.  and  t.  -)• 

For  a  (k+1)  input  adder,  the  number  of  pins  required  for  the  input 

A        A 
and  output  Adder  Transfers  t    and  t.  are  2k  each,  and   is  large  for 

A      A 
large  value  of  k.   One  way  to  reduce  the  pins  necessary  for  t.  and  t., 

is  to  encode  the  output  'Adder  Transfer'  t  into  algebraically  equiva- 
lent value  and  to  use  a  corresponding  decoder  to  decode  the  input  'Adder 
Transfer'  into  the  form  required  by  MIRBAs.   k  RBA-2s  of  a  MIRBA  produce 

k  'positive  transfers'  and  k  'negative  transfers'.   Thus  the  value  of 

A   A 
t  ,  t    lies  in  the  range  -k  to  k  and  this  can  be  encoded  into 

flog  (2k+l)"l  bits  (pins).  However,  the  corresponding  decoder  is  too 

complicated.   A  simpler  design  of  the  encoder,  MATE,  and  decoder,  MATD,  (shown 

dotted  in  Figure  4.10)  results  if  the  k  positive  and  k  negative  'Adder 

'Transfers'  are  separately  encoded  into  |log2(k+l)~f  bits  each.   The 

corresponding  decoder  is  simply  a  fan-out  network  such  that  the  bit  of 

weight  w  would  fan-out  to  w  input  'Adder  Transfers'  of  the  corresponding 

sign.   The  encoder  MATE  simply  consists  of  two  adder  networks  which  form 
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the  sum  of  k  bits  each.   Each  adder  network  requires  less  than  k  full 
adders  arranged  in  approximately  ( flogok]  +  (log2k|-2)  levels  [41]. 

It  should  be  noted  that  the  decrease  in  total  pin  requirement  for 

A      A  i —       — i 

both  t.  and  t   ,  together  from  4k  to  4|log2(k+l)|  is  obtained  at  the  cost 

of  introducing  a  new  logic  cell  (full  adder)  in  the  radix-2   adder  design 

and  also  more  delay  in  the  generation  of  t    . 

4.2.2.5  Logic  design  of  digit  product  generator  (PPG)  -  The  Digit 
Product  Generator  forms  the  product  array  of  two  signed  radix-2  digits. 
It  accepts  the  two  digits  encoded  in  SM  format  and  outputs  the  product 
array  in  redundant  binary.   The  logic  for  the  DPG  consists  of  three 
parts  (Figure  4.11a). 

a)  logic  for  generating  the  product  magnitude  digits, 

b)  logic  for  generating  the  sign  of  the  product, 

and  c)   logic  for  converting  the  magnitude  digits  to  redundant  binary  form 
for  input  to  MIRBAs.   The  gates  required  for  this  logic  are  dependent  on 
the  logic  vector  encodings  chosen  for  the  redundant  binary  digits. 

For  the  implementation  of  digit  algorithm  1  of  microinstruction  FMA 

2 
(Section  3.6.2.1.1),  the  logic  for  a)  and  b)  consists  of  k  AND  gates  and 

2 
one  exclusive-OR  gate  respectively.   The  conversion  logic  requires  k 

2 
exclusive-OR  gates  (Equation  (4.1))  or  2k  AND  gates  (Equation  (4.2))  for 

2 
encoding  LVE- ,  k  AND  gates  (Equation  (4.3))  for  LVE.  and  none  for  the 

encoding  LVE_. 

For  the  implementation  of  digit  algorithm  2  of  microinstruction  FMA 

(Section  3.6.2.1.2),  the  logic  for  a)  and  b)  consists  of  a  ROM  of  bit 

2k  2k+l 

capacity  2    .  2k  =  k  .  2     and  one  exclusive-OR  gate  respectively. 
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Figure  4.11a  Schematic  Diagram  of  Square  Array  DPG. 
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The  conversion  logic,  however,  requires  only  2k  exclusive-OR  gates 
(Equation  (4.1))  or  4k  AND  gates  (Equation  (4.2))  for  the  encoding  LVE^ 
2k  AND  gates  (Equation  (4.3))  for  LVE~  and  none  for  the  encoding  LVE- . 

The  pins  contributed  by  DPG  to  the  pin  complexity  of  DPL  are  those 

p 
pins  which  are  required  for  the  'Collective  Product  Transfers'  t    and 

P       P 
t  .   If  t    is  generated  in  PE^ ,  then  the  pins  needed  for  transmission 
i       i-1  i 

P  P 

of  t.    to  the  adjacent  PE   ,  consist  of  one  pin  for  the  sign  of  t.  - 

and  (k-l)+(k-2)+. . .+1  =  k(k-l)/2  pins  for  the  magnitude  bits,  assuming 

that  the  conversion  to  redundant  binary  form  is  done  in  PE._, .   We  shall 

P        P 
call  this  method  of  generating  t._1  and  t.  as  'Adjacent  Generation'  (AG) 

of  Collective  Product  Transfer  (Figures  4.11b  and  3.9). 

These  pins  can,  however,  be  reduced  to  only  (k+1)  from  k(k-l)   if 

2 

P     P 
the  CPT  t,,  ,  (tj  is  generated  locally  in  PE.  ,  (PE.)  itself  where  it  is 
i-1   i  .   °  l-l    l 

P   P 
needed,   t   (t.  .)  is  a  function  of  the  multiplicand  digit  4> .  .  (<J>.)  in 

PE    (PE  )  and  the  multiplier  digit  m.,  the  latter  being  the  same  in 

both  PE.  and  PE . , -  (PE .  ,).   Thus  PE.  (PE.  n)  needs  to  know  only  the 
l       l+l    l-l  l    l-l  } 

P   P 
multiplicand  digit  <J>   .  (<)>.)  in  PE.  .  (PE.)  to  generate  t   (t._,),  and 

this  requires  only  (k+1)  pins  for  SM  encoded  multiplicand  digit  <f> . +1  • 

We  shall  term  this  method  of  generating  CPTs  as  'Local  Generation'  (LG) 

of  CPT.   This  is  shown  in  Figure  4.11c.   Figure  4. lid  shows  a  DPG  using 

p 
'Local  Generation'  of  t..   In  the  LG  method  of  generating  CPTs,  the  logic 

for  DPG  requires  one  more  exclusive-OR  gate  than  for  the  AG  method. 

For  the  algorithm  2  of  FMA,  where  the  DPG  is  implemented  in  ROM,  the 

p 
pins  required  for  t.  -  are  only  (k+1)  —  one  for  sign  of  the  product  and 

k  for  magnitude  bits  of  the  product.   This  is  shown  in  Figure  3.10.   In 
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Figure  4. lid  Illustration  of  a  Combination  of  an  MIAD  and 
DPG  using  'Local  Generation'  of  t^. 
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p 
the  block  diagram  of  DPL  shown  in  Figure  4.3,  local  generation  of  t   is 

assumed.   The  register  APR  is  used  to  hold  the  multiplicand  digit  <|>.,, 

from  the  adjacent  PE,  .... 
J        i+1 

4.2.2.6  Logic  design  of  digit  sum  encoder  -  The  Digit  Sum  Encoder 
(DSE)  transforms  the  redundant  binary  sum  output  of  the  radix-2  adder 
into  an  algebraically  equivalent  radix-2   sum  digit  in  SM  format  for 
either  local  storage  in  the  processing  element  or  transmission  out  of 
the  PE.  The  DSE  is  an  iterative  logic  network  and  involves  carry  prop- 
agation.  Its  action  can  be  described  as  a  two-step  process. 

a)  determination  of  sign  of  the  redundant  binary  sum  digit  and  its 
conversion  to  an  algebraically  equivalent  sum  digit  in  2's  complement, 
and 

b)  conversion  of  2's  complement  form  of  the  sum  digit  to  SM  format, 
Figure  4.12a  shows  DSE  in  block  diagram  form.   Let  the  input  and  output 
sum  digit  x.  be  respectively  given  by  (4.8  and  (4.9) 


k_1   *     i  * 

xi  =   I       x   .  2J  x   e  {1,0,1}  (4.8) 


S±    k-1 
=  (-D    •   I   x     2J         S   x   e  {0,1}  (4.9) 


3=0    ;j  XJ  xj 


* 


where  x±     is  represented  by  a  2-tuple  logic  vector  (x .  ,  x.  )  such  that 

J    J 


X±    .  X    E  {0,1}. 

j   j 
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Figure  4.12a  Block  Diagram  of  Digit  Sum  Encoder  (DSE) 
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The  Redundant  Binary  to  Two  Complement  (RBTC)  logic  (Figure  4.12b) 
converts  input  x  to  y  such  that 


y 


pi    k~i 

■  (-1)   k  +   [   y   .  2J  ye  {0,1}  (A. 10) 

The  logic  equations  of  RBTC  network  for  the  three  logic  vector  en- 
codings of  the  input  sum  digit  are  given  by 


LVE3  : 


Y±  =  x±   ©  ?±  j-0,l,...,k-l        (A. 11) 

J       J       J 

\  =  (Xi  Axi  )v(P.  Ax.  )  (4.12) 

j+1        J     3       J     J 


P.    =  0 
*0 


LVE„  :   Same  as  for  LVE  . 


LVE1  :    y.   =  x±   ©  x±   ©  P.  (A. 13) 
J       J       j       J 


Pi     =   (xi  AXi  )v(Pi  *(xi   ©  xi  >>  <4*14> 

3+1        j    J      J     J       J 


P    =  0 
x0 


The  logic  equations  for  the  logic  network  TCSM  (Figure  A.  12c)  that 
converts  2 '  s  complement  form  y  to  corresponding  SM  format  are  independent 
of  logic  vector  encodings  for  the  input  sum  digit.   The  logic  equations  are 
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x         =     y        ©    <°±AV)  (4.15) 

J  3  3 

Z  =     Z     V  y.  (4.16) 

j+l  3  J 

•i  -  h  ■  \ 

The  signal  Z   ,  if  equal  to  logical  zero  implies  that  the  binary  digits 
3 

y   ,  y    , . . . ,y   are  all  logical  zero. 

ij   j-i     0 

The  Digit  Sum  Encoder  DSE  logic  is  also  used  to  achieve  the  radix 

k  k 

(2  )  complement  and  diminished  radix  (2  -1)  complement  of  the  magnitude 

bits  of  ROB  input  to  DSE  via  sDSE.   Assuming  logic  vector  encoding  LVE~, 

xi   =  1,        J  =  0,l,...,k-l 
J 

P    =  0 

°i    "  ° 


will  cause  the  radix  complement  of  the  magnitude  bits  to  appear  at  the 

output  of  DSE,  whereas 

X±     «  1,     j  =  0,1,. . .,k-l 
3 

will  generate  the  diminished  radix  complement  of  the  input  magnitude 
bits. 
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Similarly,  X.   =  0,    j  =  0,1 k-1 

J 


P   ■  1,  a.  -  0 
x0      1 


will  subtract  unity  from  the  value  of  the  input  magnitude  bits. 

These  particular  values  of  x.  >  P.  >  and  a     are  made  use  of  in  the 

J   10 

processing  of  microinstructions  AR  and  NR,  as  described  in  Section 
4.3.2.3.7. 

However,  in  the  case  of  inter-register  transfer  microinstructions 
TD  and  TI,  the  magnitude  bits  at  the  input  and  output  are  to  remain 
unchanged;  the  sign  bit,  S  ,  is  equal  to  the  complement  of  SRQB  for  micro- 
instruction TI,  where  S    denotes  the  sign  of  the  digit  on  the  bus  ROB. 

ROB 

From  Figures  4.12(b)  and  4.12(c),  we  see  that  RBTC  and  TCSM  consist 
of  k-stages  each  of  identical  logic  cells.   Each  cell  requires  four 
2-input  NAND  gates  and  one  exclusive-OR  (EX)  gate.   An  EX-gate  is 
equivalent  to  four  2-input  NAND  gates.   Therefore  the  total  number  of 
gates,  G_q„  required  by  DSE  logic  using  logic  vector  encoding  LVE~  or 
LVE„  is  given  by 

GDSE  =  16K  +  Cx  (4.17) 

where  C.  is  a  constant  and  gives  the  gates  necessary  for  generation  of 
§.  and  c.  under  various  control  signal  conditions.  Use  of  logic  vector 
encoding  LVE.  will  raise  G,.^  to  26K  +  C  . 

In  the  remainder  of  this  chapter,  we  shall  assume  only  the  sign-magni- 
tude logic  vector  encoding  LVE-  for  the  redundant  binary  digit  because 
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a)  conversion  from  sign-magnitude  format  SM  to  KB  format  is  the 
simplest  for  logic  vector  encoding  LVE„  for  the  redundant  binary  digit, 
as  shown  in  Equation  (4.4),  and 

b)  the  number  of  gates  required  for  the  logic  implementation  of 
the  Digit  Sum  Encoder,  DSE,  is  less  in  the  case  of  the  encoding  LVE-, 
than  that  of  LVE..  whereas  the  gates  required  for  an  RBA-2  are  comparable 
for  both  the  encodings  LVE,  and  LVE  .   The  encoding  LVE„  is  too  expensive 
for  the  logic  implementation  of  an  RBA-2  and  hence  of  MIAD  which  is  the 
major  consumer  of  gates  in  the  DPL. 

4.2.2.7  Logic  design  of  selector  networks  -  Since  the  Adder  MIAD 
and  Digit  Sum  Encoder,  DSE,  are  shared  by  more  than  one  microinstruction, 
selector  networks  are  needed  in  order  to  route  appropriate  data  to  the 
inputs  of  these  processing  logics.   These  selector  networks  also  do  re- 
formatting of  data,  if  necessary.   Besides  the  selector  networks  sADR 
and  sDSE,  two  more  selector  networks,  sROB  and  sRIB,  are  necessary  for 
transferring  data  out  of  and  into  the  various  registers  of  the  register 
file.   In  addition,  selector  sTOP  is  used  to  choose  the  contents  of  out- 
put port  TOP.   In  the  following  discussion,  logic  vector  encoding  LVE- 
for  the  redundant  binary  digit  is  assumed. 


sADR 


4.2.2.7.1  Logic  design  of  adder  input  selector  (sADR)  -  Selector 

accepts  inputs  from  two  sources:   1)  'Product  Array'  w.  and 

P 
'Collective  Product  Transfer'  array  t.  along  with  their  corresponding 

signs  Sw.  and  St.  from  the  output  of  DPG  and  2)  the  contents  R2,...,R5 

of  the  internal  registers  INR2 , . . . , INR5  of  the  register  file.  These 
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inputs  are  in  sign-magnitude  format,  SM  .   Depending  on  the  microinstruc- 
tion, the  sADR  directs  the  appropriate  data  reformatted  in  redundant 
binary  form  to  the  inputs  of  the  MIAD.   For  the  microinstruction  FMA, 
the  outputs  of  DPG  are  routed  appropriately  so  that  the  redundant  binary 
elements  of  the  'product'  and  'transfer'  arrays  are  added  by  MIRBAs  of 
appropriate  weights.   In  the  case  of  microinstruction  SS,  only  contents 
R2  of  register  INR2  are  inputed  to  the  adder  and  for  microinstruction  MS, 
the  contents  of  one  or  more  of  the  four  registers  INR2 , . . . , INR5  are 
directed  to  the  input  of  the  adder.  The  contents  Rl  of  the  accumulator 
register  INR1  are  inputed  directly  to  the  adder.  The  logic  networks  for 
the  selector  sADR  for  radix  16  (k  =  4)  are  shown  in  Figures  A. 13a  and 
4.13b.   Figure  4.13a  shows  the  selector  for  the  magnitude  bits  and 
Figure  4.13b  shows  the  generation  of  appropriate  sign  bits  for  the  re- 
dundant binary  adder  inputs.   SR.  (j=l,...,5)  indicates  the  sign  bit  of 
inputs  Rl, . . . ,R5. 

The  control  signals  Rj  sADR  (j=2,3,4,5)  and  SWTsADR  are  provided  by 
local  control  logic  in  the  PE.   Since  the  selector  networks  have  no 
memory  and  the  data  at  the  input  of  adder  MIAD  must  be  continuously 
available  throughout  the  processing  of  microinstructions  SS,  MS  and  FMA, 
the  selector  control  signals  are  permanently  tied  to  the  appropriate 
outputs  of  the  microinstruction  decoder. 

For  any  radix-2  ,  and  assuming  that  the  adder  MIAD  is  made  up  of 
k-stages  of  (k+1)  input  MIRBAs,  the  gates  required  for  magnitude  and 

sign  bits  (using  logic  vector  encoding  LVE~  for  redundant  binary)  are 

2 
respectively  3k  and  (3k+l) .  Denoting  by  G  .    the  total  number  of  gates 

for  selector  sADR  for  a  radix-2  PE,  we  have 
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Figure  4.13a  Logic  Implementation  of  Selector  sADR 
for  Magnitude  Bits. 
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Figure  4.13b  Logic  Implementation  of  Selector  sADR 
for  Sign  Bits. 
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GsADR  =  3k2  +  <3k+1> 

=  3k2  +  3k  +  1  (4.18) 


4.2.2.7.2  Logic  design  of  digit  sum  encoder  selector  (sDSE)  - 
The  selector  sDSE  (shown  in  Figure  4.14)  accepts  inputs  from  two 

it  it 

sources— the  two  outputs  a  SS  and  a  MF.  (j-0,1, . . .  ,k-l)  of  the  adder 
MIAD  and  the  ROB   (J-0,1, ... ,k-l) .   The  control  signals  ASSsDSE,  AMFsDSE 
and  ROBsDSE,  respectively  select  the  MIAD  outputs  a  SS.  and  a  MF  corre- 
sponding to  microinstructions  SS,  FMA  and  MS  and  the  bus  ROB.   The  control 

signal  SCHI  appropriately  sets  the  sign  bit  x.,   (j=0,l, . . .  ,k-l)  of  the 

J 


redundant  binary  input  Xj   of  tne  DSE  to  achieve  radix  complements  or 


* 
diminished  radix  complement  or  direct  transfer  of  the  magnitude  bits 


of  ROB.   This  was  explained  earlier  in  Section  4.2.2.6.   Figure  4.14 
shows  the  logic  implementation  of  sDSE  for  radix-16.   Since  the  selector 
sDSE  has  no  memory,  the  appropriate  control  signals  must  be  held  active 
throughout  the  processing  of  the  microinstruction. 

For  any  radix-2  ,  the  total  number  of  NAND  gates,  G  DqE,  required 
for  the  logic  of  sDSE  is  given  by 

GsDSE  -  7k-  (4-19> 

4.2.2.7.3  Logic  design  of  selectors  sRIB,  sROB  and  sTOP  -  The 
selector  sRIB  is  a  three  input  multiplexer  and  has  three  input  sources — 
the  DSE,  APR  and  the  digit  field  of  microinstruction  register  MIR  in  the 
control  logic  of  PE.   The  selector  is  one  digit  wide  and  the  NAND  gates 
required  for  the  logic  implementation  of  this  selector  shown  in  Figure  4.15 
is  given  by  G  DT11  where 
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Figure  4.14  Logic  Implementation  of  Selector  sDSE. 
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Figure  4.15  Logic  Implementation  of  Selector  sRIB. 
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GsRIB  -  4(k+1)-  (4-20> 

The  control  signals  DSEsRIB,  APRsRIB  and  MIRsRIB,  respectively  select 
the  sources  DSE,  APR  and  MIR.   The  path  from  MIR  to  sRIB  is  made  use  of 
in  the  processing  of  microinstructions  RS  and  LPM. 

The  width  of  selector  sTOP  (Figure  A. 16)  is  equal  to  the  width  of 
output  port  TOP  .   The  width  of  TOP.  is  determined  by  the  number  (=k+l) 
of  bits  required  for  the  address  space  of  PEM.  plus  one  more  for  the 

Read/Write  function  of  PEM.  and  the  bits,  P     required  for  the  'Adder 

1  i-1 

A  A 

Transfer'  t    which  is  dependent  on  the  method  of  encoding  used  for  t._  . 

Assuming  the  width  of  TOP.  to  be  b,  b  is  given  by 

b  =  Max(k+2,  P^   ). 
i-1 

For  MIADs  using  encoders  and /or  redundancy  ratio  6  <_  2/3,  k+2  is  greater 

than  P       Therefore  the  gates,  G     required  for  the  logic  implemen- 

l-l 
tation  of  selector  sTOP  is 

GsTOp  =  3 (k+2).  (4.21) 

The  selector  sROB  selects  the  contents  of  one  of  the  registers  of 
the  register  file  on  to  the  register  file  output  bus  ROB.   The  gates 
required  for  this  network  are  dependent  on  the  number  of  registers  in 
the  register  file  and  the  bit  width  of  the  registers.   For  radix-2  , 
the  register  width  is  (k+1)  bits  and  assuming  (k+1)  registers  in  the 
register  file,  the  total  gates  required  are 

GsROB  "  (k+D  (k+2).  (4.22) 

Figure  A. 17  shows  logic  implementation  of  sROB  for  radix-2   (k=A) . 
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Figure  4.16  Logic  Implementation  of  Selector  sTOP, 
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Figure  4.17   Logic  Implementation  of  Selector  sROB 


132 


Note  that  these  selectors  have  no  memory.  The  control  signals 
would  have  to  remain  active  throughout  the  processing  of  a  micro- 
instruction. Although  the  selectors  are  shown  to  have  separate  control 
signals,  fewer  control  signals  with  a  local  decoder  would  suffice.   But 
the  separate  control  signals  are  shown  for  the  ease  of  exposition  because 
the  separate  names  of  the  signals  help  to  identify  the  sources  easily. 

4. 2. 2. 7. A  Storage  buffer  registers  of  PPL  -  In  addition  to  the  com- 
binational logic  for  processing  and  the  selector  networks  for  proper  data 
routing,  the  DPL  has  three  buffer  registers  GIR,  APR  and  IB.   GIR  and  APR 
are  used  to  hold  the  G-information  from  the  adjacent  PE.  The  register 
GIR  holds  the  'Adder  Transfer'  t  from  the  adjacent  PE.,,  and  the  register 

APR  stores  the  multiplicand  digit  <j>    for  the  local  generation  of  'Product 

P  k 

Transfer'  t   in  the  DPG.   The  width  of  APR  is  (k+1)  bits  for  radix-2  and  the 

width  of  GIR  depends  upon  the  bit  requirements  of  t  .   At  the  maximum, 

it  is  equal  to  2k  if  no  encoder  MATE  is  used  in  the  design  of  adder 

MIAD.   However,  if  either  the  number  of  inputs  to  the  MIRBAs  is  reduced 

by  changing  the  redundancy  of  the  multiplier  digit  or  by  any  other 

means,  then  the  bit  width  of  GIR  would  be  correspondingly  reduced. 

Assuming  that  the  'encoders'  and  'decoders'  MATE  and  MATD  are  not  used 

A      A  k 

for  the  t .  and  t  ,  ,  and  the  inputs  to  each  MIRBA  is  k'  for  radix-2 

adder,  then  the  bit  width  of  GIR  is  2(k'-l). 

The  register  APR  is  also  used  in  the  processing  of  left  shift 

microinstruction  LS  and  is  used  to  hold  the  shifted  digit  from  the 

right  neighbor  PE.  -  temporarily  before  being  stored  in  the  register 

file  via  the  selector  sRIB. 
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Since  the  outputs  of  internal  registers  of  the  register  file  are 
directly  and  permanently  connected  to  the  input  of  combinational 
processing  logic,  it  is  necessary  to  provide  a  buffer  register,  IBR  at 
the  output  of  selector  sRIB.   The  output  of  the  selector  sRIB  is  gated 
into  the  register  IBR  and  thus  isolates  the  input  bus  of  the  registers 
from  any  changes  which  might  occur  due  to  feedback  through  the  combina- 
tional logic  when  the  contents  of  the  buffer  register  IBR  (i.e.,  the 
result  digit)  are  transferred  to  the  appropriate  register  in  the 
register  file. 

The  bit-width  of  the  register  IBR  is  (k+1)  for  a  radix-2  digit. 

4.3  Design  of  PE  Control 

The  processing  of  a  microinstruction  in  the  PE  requires  the  activa- 
tion of  the  various  data  paths  and  the  conditioning  of  combinational 
transformation  logic  of  the  DPL,  in  a  certain  temporal  order  depending 
on  the  nature  of  the  microinstruction.   These  time  ordered  activation 
control  signals  are  generated  by  the  PE  Control  Logic  (PCL)  which  is 
locally  resident  in  the  PE. 

Another  function  of  the  local  PE  control  is  to  coordinate  the 
actions  of  the  PEs  not  only  to  obtain  'G'  information  from  adjacent 
PEs  for  the  processing  of  the  microinstruction  but  also  to  receive 
and  transmit  the  microinstructions  from  and  to  the  adjacent  PEs.   The 
latter  is  necessary  to  process  a  'machine  instruction'.   Each  PE 
executes  the  same  sequence  of  microinstructions  which  is  issued  by  MCU 
depending  on  the  'machine  instruction'  to  be  processed  by  the  Arithmetic 
Unit  and  the  specific  operand  values.   After  executing  microinstruction 
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j-1,  a  typical  PE,  PE  say,  must  determine  the  value  of  .G.  and  inform 

the  PE..  of  its  availability.   .G.  is  needed  by  PE.  ,  to  execute 
microinstruction  j.   PE  also  passes  the  jth  microinstruction  and  modi- 
fier to  PE .  ,  n  so  that  PE....  will  determine  .G.  ...  in  cooperation  with 
i+1  i+I  j  i+1'       v 

PE.,~,  if  necessary.  When  PE.  receives  .G. .,  it  performs  the  micro- 
i+2  J i         j  i+1 

instruction  j  and  begins  the  procedure  for  microinstruction  j+1. 

The  control  strategy  for  implementing  the  coordination  of  the 
various  PEs  can  be  either  synchronous  or  asynchronous.   In  the  former 
case,  all  the  PEs  act  in  synchronism  with  some  central  clock  whereas 
in  the  asynchronous  case,  all  the  activities  are  controlled  by  request- 
response  signals.   In  this  paper,  asynchronous  control  with  request- 
response  signals  is  chosen  because  of  the  following  advantages: 

a)  It  avoids  the  clock-skew  problems  when  a  large  number  of  PEs 
are  concatenated  together  for  high  precision  of  arithmetic. 

b)  Due  to  the  pipeline  nature  of  processing,  different  PEs  at  any 
instant  are  executing  different  microinstructions  which  take  different 
times  to  execute.  The  request-response  strategy  will  provide  overall 
better  average  speed  of  processing. 

c)  The  asynchronous  control  is  compatible  with  the  'localized' 
nature  of  processing  and  an  autonomous  and  modular  arithmetic  element. 

However,  it  does  have  the  disadvantage  of  increasing  the  number 
of  pins  required  for  the  PCL. 


135 


4.3.1  Logical  organization  of  PE  control  -  The  PE  control  is 
organized  as  a  set  of  six  interacting  subcontrols  some  of  which  are 
active  concurrently  while  others  are  activated  in  sequence,  depending  on 
the  nature  of  the  control  algorithm  for  the  microinstruction.   Concurrently 
interacting  controls  allow  an  average  speed  up  in  the  processing  of  micro- 
instructions by  allowing  independent  operations  to  take  place  in  parallel. 

Figure  4.18  shows  the  various  subcontrols  and  their  interaction. 
The  division  of  PE  control  into  subcontrols  is  based  on  a  functional 
grouping  of  the  various  steps  in  the  control  flow.   The  various  sub- 
controls  are  R-control,  T-control,  G-control,  E-control,  F-control  and 
DM-control. 

The  Decode  and  Main  or  DM-control  is  the  main  control  which  super- 
vises and  coordinates  the  actions  of  other  subcontrols.   It  handles  the 
decoding  of  the  microinstruction,  sets  up  the  necessary  data  paths  in 
DPL,  and  then  chooses  the  proper  subcontrols  and  their  temporal  order 
for  the  execution  of  the  control  algorithm  of  the  microinstruction.   (In 
a  crude  software  analogy,  the  DM-control  can  be  considered  as  the  Main 
procedure  and  other  subcontrols  which  are  invoked  by  DM-control  as 
subroutines.) 

The  Receive  or  R-control  and  the  Transmit  or  T-control  are  the 

primary  controls  for  the  coordination  of  PEs.   R-control  is  concerned 

with  accepting  the  microinstruction  from  the  left  neighbor  PE._,  and 

acknowledging  the  receipt  of  the  microinstruction  (OP-code  y.  and  the 

modifier  field  .F.)«   The  T-control  transmits  the  received  micro- 
3  V 

instruction  with  the  same  or  a  new  modified  F-field  .F    ,  depending 

on  the  nature  of  the  microinstruction,  to  the  PE .  , - . 

l+l 
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The  G-control  and  E-control  together  can  be  considered  as  consti- 
tuting the  main  processing  controls  for  the  microinstruction.  The  G- 
control  generates  the  G-information  for  the  left  neighbor  PE..  and 
accepts  the  G-information  from  the  right  neighbor  PE  . .   The  Execute 
or  E-control  activates  the  necessary  control  signals  to  the  combinational 
logic  to  calculate  and  gate  the  result  digit  in  appropriate  internal 
register  of  the  register  file.   In  addition  to  this,  the  status  of  the 
digit  in  the  accumulator  register  is  set.  The  status  checking  involves 
determining  the  sign  and  magnitude  of  the  digit.   If  the  accumulator 
digit  is  zero,  the  sign  of  the  digit  is  considered  to  be  unknown. 

The  F-control  is  used  when  a  new  value,  different  from  that 
received,  of  the  modifier  field  has  to  be  sent  to  the  right  neighbor 
PE  - .   It  is  made  use  of  in  right  shift  microinstruction  RS. 

4.3.1.1  Global  description  of  interaction  of  subcontrols  - 
Figure  4.18  shows  the  interaction  of  the  various  subcontrols.   It 
should  be  noted  that  Figure  4.18  does  not  show  the  hierarchical  order 
in  which  the  various  subcontrols  are  invoked  by  DM-control  but  only 
shows  a  gross  overview  of  the  interaction.   The  specific  temporal  order 
of  the  various  subcontrols  in  the  control  sequence  of  any  microinstruc- 
tion is  discussed  later  in  Section  4.3.2.3.3. 

The  control  sequence  for  every  microinstruction  begins  in  the  R- 
control.   The  R-control,  on  receiving  a  go-ahead  signal  from  DM-control  to 
accept  another  microinstruction  from  the  left  neighbor  PE...,  accepts 
the  microinstruction  and  acknowledges  back  the  receipt  of  the  micro- 
instruction.  It  also  invokes  the  DM-control.   The  DM-control  decodes 


137 


INVOKE 
RETURN  (REPLY) 


Figure  4.18     Logic  Organization  of  PE   Control   Signal 
Generator. 
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the  microinstruction,  sets  up  the  data  paths  in  the  DPL  and  invokes  one 
or  more  of  F,  G,  T,  and  E  controls  depending  on  the  microinstruction 
type.  The  F-control  makes  the  changes  in  the  modifier  field  of  the 
microinstruction  and  calls  on  the  T-control  to  transmit  the  modified 
microinstruction  to  PE  . .   F-control  is  invoked  only  for  right  shift 
microinstruction  RS.   If  the  processing  of  microinstruction  requires 
G-inf ormation,  the  G-control  and  T-control  are  invoked  in  parallel. 
The  G-control  can  be  conceptually  considered  as  comprising  of  two  sub- 
controls:   G  -control  which  generates  G-inf ormation  for  the  microinstruction 
executing  in  adjacent  PE .  , ,  and  G  -control  which  accepts  G-information 
from  the  right  neighbor  PE .  - .   (In  the  case  where  G-information  depends 
logically  on  two  or  more  right  neighboring  PEs  (e.g.,  microinstructions 

FMA,  AR) ,  the  subcontrols  G  -control  and  G  -control  interact  with  each 

gn  ap 

other.)   After  the  necessary  G-information  for  the  execution  of  the 

microinstruction  has  been  obtained,  the  G  -control  branches  to  E-control 

ap 

for  the  execution  of  the  microinstruction. 

In  those  cases  when  G-control  is  not  invoked  by  DM-control  because 
no  G-information  is  needed  from  adjacent  neighbors  (e.g.,  microinstruc- 
tions TD,  TI,  LDC) ,  the  DM-control  directly  calls  upon  E  and  T  controls 
in  parallel.  The  T-control  transfers  the  microinstruction  to  the  right 
neighbor  PE   ... 

As  the  various  invoked  subcontrols  finish  their  sequence  operations, 
they  report  back  to  the  DM-control.   When  all  the  invoked  subcontrols 
are  finished,  the  DM-control  replies  back  to  the  R-control  which  was 
suspended  earlier  from  accepting  any  more  microinstruction.   The  R-control 
now  is  again  ready  to  accept  another  microinstruction  and  the  control 
sequence  begins  again. 
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4.3.2  Logic  design  of  PE  control 

4.3.2.1  Block  diagram  description  of  PE  control  logic  (PCL)  - 

Figure  4.19  shows  the  major  components  of  the  PCL  in  block  diagram 
form.   It  consists  of  a  microinstruction  register  MIR,  the  selector 
network,  sMIR,  the  'Zero  magnitude  and  Sign  Detector',  ZSD,  and  the  timing 
control  signal  generator,  TCS.   The  register  MIR  is  11  bits  wide  and 
is  used  to  hold  the  microinstruction,  received  on  microinstruction 
Jjiput  _p_ort,  MIP  ,  f  rom  adjacent  PE.,,  during  processing  by  PE . .   The 
selector  sMIR  is  a  two  way  multiplexer  which  chooses  either  MIP  or  ROB 
from  the  DPL  as  the  appropriate  source  of  data  for  the  bits  <4:0>  of  MIR. 
The  ZSD  is  a  combinational  logic  block  which  monitors  the  sign  and 
magnitude  bits  of  the  accumulator  register  INR1.   It  sets  flip-flop 
Z    to  logical  state  '1'  if  the  magnitude  of  the  accumulator  in  PE.  is 
zero.   Flip-flop  S.  is  set  to  the  state  of  the  sign  bit  SRI  of  accumu- 
lator register.   The  TCS  generates  the  timing  signals  for  the  activa- 
tion of  data  paths  and  processing  logic  in  DPL  and  for  the  coordination 
of  the  adjacent  PEs. 

The  generation  of  the  appropriate  control  signals  and  their  temporal 
order  depends  on  the  microinstruction — its  digit  algorithm  and  the  data 
flow  structure  of  DPL. 

4.3.2.2  Design  and  description  of  microinstruction  formats  -  The 
major  consideration  in  the  design  of  the  various  microinstruction  formats 
are: 
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The  major  consideration  in  the  design  of  the  various  microinstruc- 
tion formats  are: 

a)  the  bit  width  of  the  microinstruction  should  be  as  small  as 
possible  so  that  the  pins  required  for  the  input  port  MIP  be  least,  and 

b)  the  microinstructions  should  be  powerful  so  that  they  take  full 
advantage  of  the  data  flow  structure  of  the  DPL  and  facilitate  the  micro- 
programming of  the  'machine  instruction'. 

These  two  aims  are  conflicting  in  nature  because  b)  requires  a  large 
instruction  width.   A  compromise  was  achieved  by  using  varying  number  of 
bits  for  the  OP-code  of  the  microinstruction. 

Basically,  each  microinstruction  has  an  OP-code  field  u.  and  a  modi- 
fier field,   F  as  was  discussed  in  Section  2.6.   The  basic  OP-code  field 
is  3  bits  long  and  the  modifier  field  depends  on  the  bit  width  (radix  of 
arithmetic  processing)  of  the  PE  and  the  number  of  addressable  registers 
in  the  register  file.   The  modifier  field  .F.  is  further  divided  into  two 
subfields — one  field  carries  the  address  of  the  register  in  the  register 
file  of  the  PE  and  the  other  field  carries  either  a  digit,  or  the  address 
of  the  PEM  location  in  local  operand  mantissa  memory.   For  some  micro- 
instructions, these  fields  are  used  for  other  purposes. 

The  Figure  4.20  shows  the  specific  OP-code  bit  assignment  and  the 
formats  for  various  microinstructions.   In  this  figure,  it  is  assumed 
that  the  bit  width  of  the  PE  is  5  bits  (that  is,  radix  is  16)  and  that 
there  are  5  (=k+l)  registers  in  the  register  file  of  the  DPL.  The  micro- 
instructions LPM,  SPM,  RS,  LS  and  LDC  have  three  bit  OP-codes  whereas 
microinstructions  TD  and  TI  have  four  bit  OP-codes.   The  OP-codes  for 
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Figure  4.20  Microinstruction  Codes  and  Formats. 
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microinstructions  SS,  MS  and  FMA  are  six  bits  long  whereas  for  AR  and  NR, 
they  are  seven  bits  long.   The  varying  length  of  the  OP-code  allows  a 
basic  three  bit  field  for  OP-codes,  otherwise  a  straightforward  coding 
of  12  microinstructions  would  have  required  four  bit  OP-codes. 

It  should  be  noted  that  the  use  of  a  more  restricted  set  of  micro- 
instructions could  have  reduced  the  bit  width  of  the  microinstructions 
at  the  cost  of  less  flexibility  in  microprogramming  capability. 

In  general,  for  a  radix-2  arithmetic  structure,  the  bit  width  of 
a  PE  digit  is  (k+1) ,  and  assuming  the  register  file  to  consist  of  (k+1) 
registers,  the  bits  required  for  a  microinstruction  are  given  by 

I,   =  Instruction  width  in  bits 
b 

=  3  +  flog2(k+l)l  +  (k+1) 

=  k  +  |Tog2(k+l)"l  +  *•  (4.23) 

A  description  of  the  various  microinstructions  was  given  earlier 
in  Section  2.6.  The  function  of  each  microinstruction,  briefly,  is 
again  given  below. 

The  memory  access  microinstructions  LPM  and  SPM  are  respectively 
used  to  fetch  data  from  and  store  data  to  the  processing  element  memory 
PEM  associated  with  the  PE.   The  microinstruction  field  A2  gives  the 
location  in  PEM  and  field  Al  identifies  the  register  in  the  register 
file. 

In  the  shift  microinstructions  RS  and  LS,  Al  identifies  the 
register  to  be  shifted.  The  field  Dl  carries  the  digit  from  the  regis- 
ter in  the  left  adjacent  PE  and  D2  identifies  the  digit  which  must  be 
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loaded  in  the  register  of  the  least  significant  PE.  This  facility  is 
made  use  of  in  multiplication  where  the  digit  shifted  out  of  the  most 
singificant  PE,  during  left  shift  of  partial  products,  has  to  be  saved  in 
the  least  significant  digital  position  of  the  multiplier  operand  register. 

The  field  Al  in  microinstruction  LDC  identifies  the  register  to  be 
loaded  with  the  digit  given  in  field  Dl. 

In  microinstructions  TD  and  TI,  the  Al  and  A2  respectively  identify 
the  source  and  destination  registers  in  the  register  file.  Note  that  A2 
can  be  equal  to  Al  in  microinstruction  TI,  whereas  such  a  condition  in 
TD  is  meaningless. 

In  the  case  of  arithmetic  instruction  SS,  no  special  registers  are 
identified  because  this  microinstruction  always  causes  the  contents  of 
accumulator  register  INR1  and  operand  register  INR2  to  be  added  with 
the  result  going  to  the  accumulator  register. 

For  microinstruction  MS,  field  A3  identifies  the  various  registers 
of  the  register  file  whose  contents  would  be  added  by  microinstruction 
MS.   Note  that  the  address  in  A3  is  not  encoded  but  rather  each  bit  of 
A3  identifies  a  register.   A  bit  value  of  '1'  in  A3  indicates  that  the 
corresponding  file  register  would  take  part  as  the  source  of  the  operand. 
The  ' 1'  in  the  least  significant  position  of  A3  indicates  that  accumu- 
lator register  INR1  would  always  be  one  of  the  source  registers  in  the  MS 
instruction.   The  result  of  addition  always  goes  to  the  accumulator 
register  INR1. 

The  D3  field  in  microinstruction  FMA  identifies  the  multiplier 
digit  for  the  formation  of  the  partial  product. 
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D3  field  in  microinstruction  FMA  identifies  the  multiplier  digit 
for  the  formation  of  the  partial  product. 

The  microinstruction  bit  4  carries  the  sign  of  the  operand,  S  ,  which 
is  nothing  but  the  sign  of  the  most  significant  nonzero  digit  in  the 
accumulator.  This  sign  is  first  determined  by  the  MCU  by  a  sequence  of 
left  shift  microinstructions  and  testing  the  status  indicators  Z  and  S.. 
of  the  most  significant  PE. .  The  proper  value  of  S   ,  that  is,  bit  4,  is 
set  by  MCU  before  issuing  the  microinstruction. 

4.3.2.3  Description  of  subcontrols  by  control  sequence  charts  - 
The  subcontrols  are  multi-output  finite  state  machines  which  pro- 
duce control  signals  in  proper  temporal  order  for  the  execution  of 
various  microoperations  during  the  processing  of  a  microinstruction. 
These  control  signals  condition  the  combinational  processing  logic  to 
perform  elementary  microoperations  like  opening  or  closing  of  a  register 
gate,  setting  of  selector  networks  to  certain  states  or  the  setting  of 
a  control  status  memory  element.   In  addition,  some  of  the  control 
signals  act  as  interface  request-response  signals  for  the  coordination 
of  various  PEs  or  to  access  the  local  memory  (PEM)  module. 

The  operation  of  the  finite  state  machine  can  be  described  by  a 
control  sequence  chart  (CSC)  which  is  a  flowchart  like  description  of  a 
control  sequence.  A  control  sequence  is  an  instance  of  the  execution 
of  a  subcontrol.  The  control  sequence  chart  shows  the  various  control 
signals  and  their  temporal  order  generated  during  the  execution  of  the 
subcontrol. 
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4.3.2.3.1  Control  sequence  chart  conventions  -  A  control  sequence 
chart  (CSC)  consists  of  a  set  of  rectangular,  diamond  and  pentagonal 
shaped  boxes  and  entry  and  exit  symbols  connected  together  in  a  two- 
dimensional  pattern  with  straight  directed  lines.   The  arrows  on  the 
lines  indicate  the  direction  of  the  control  flow  in  the  sequence.  The 
various  symbols  used  in  the  CSC  are  shown  in  Figure  4.21. 

The  diamond  shaped  symbol  (Figure  4.21a  and  4.21b)  represents  the 
decision  element  with  single  entry  and  two  exit  points.   The  exit  points 
are  labeled  yes/no  (Figure  4.21a)  which  indicate  the  truth/falsehood  of 
the  statement  written  inside  the  box,  or  the  exit  points  are  labeled 
with  the  actual  name  of  the  option  (Figure  4.21b)  that  is  valid  on  that 
exit  point. 

The  rectangular  box  of  Figure  4.21c  represents  a  control  step.   A 
control  step  is  a  set  of  microoperations  (indicated  by  control  signals) 
enclosed  in  the  rectangular  box.   The  time  ordering  of  the  micro- 
operations  within  a  control  step  is  not  important  and  they  are,  in 
general,  all  activated  in  parallel.  The  rectangular  boxes  of  Figures 
4.21d  and  4.21e  represents  the  invoking  of  another  subcontrol  whose 
name  is  written  inside  the  box.   However,  in  the  case  of  Figure  4.21d, 
the  exit  from  the  subcontrol  returns  the  control  flow  to  the  point  where 
it  was  invoked  (like  a  subroutine  call  in  software)  whereas  the  control 
flow  at  the  end  of  execution  of  the  subcontrol  indicated  in  Figure  4.21e 
branches  to  the  next  point  in  the  control  sequence  chart. 

The  pentagonal  boxes  of  Figures  4.21f  and  4.21g  respectively  repre- 
sent the  'FORK'  and  'JOIN'  symbols.   The  'FORK'  symbol  indicates  that 
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Figure  4.21  Control  Sequence  Chart  Symbols. 
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the  subcontrols  at  the  exit  points  of  the  symbols  be  activated  concur- 
rently.  On  the  other  hand,  the  'JOIN'  symbol  signifies  that  the  replies 
from  all  the  concurrently  active  control  sequences  indicated  by  the  entry 
points  to  the  box  must  be  true  before  the  control  flow  can  proceed  any 
further. 

The  entry  to  a  control  sequence  chart  is  indicated  by  a  single 
circle  (Figure  4.21h)  with  the  name  of  the  corresponding  subcontrol 
written  in  the  circle.   The  oval  symbol  (Figure  4.21j)  represents  a 
'return'  to  the  invoking  point  of  the  subcontrol  in  the  control  flow. 
A  double  circle  (Figure  4.21k)  represents  a  branch  to  the  entry  point 
of  the  subcontrol  whose  name  is  written  inside  the  circle. 

The  control  sequence  charts  which  are  too  big  to  be  fitted  on  a 
single  page  have  been  drawn  on  different  pages  but  the  entry  point  on 
each  page  is  labeled  the  same.   An  example  is  the  DM-control. 

The  microoperations  within  a  control  step  box  are  indicated  by 
either  control  signals  of  the  form 

control  signal  name:  =  1  or  0 
or  transfer  statements  of  the  form 
x  •*■   y 

x  •*-   1  or  0. 
Most  of  the  control  signals  in  DPL  are  level  signals  whereas  the  inter- 
face request-response  signals  are  Pulse  signals  whose  leading  and  trail- 
ing edges  are  used  to  indicate  request,  acknowledge  and  response  states. 
The  '1'  or  '0'  on  the  right  hand  side  indicate  the  logically  'active' 
and  'inactive'  state  respectively.   In  the  case  of  transfer  statements 
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indicated  by  the  arrow  «-,  x  represents  a  control  status  memory  element 
which  is  set  to  the  state  '1'  or  '0'  or  to  the  state  of  *y'. 

The  control  signals  for  the  selector  networks  are  of  the  form  XsY 
where  X  indicates  the  input  source  to  the  selector  network  sY.  The  gate 
signals  for  the  register  is  of  the  form  gRegisterName  where  RegisterName 
identifies  the  register  which  has  to  be  loaded  with  information. 

Square  brackets  [  ]  indicate  a  subscript  value  as  in  ISP  notation 
[42]  and  thus  the  address  of  a  register  or  memory  location  when  these 
brackets  appear  after  a  memory  element  name.  The  value  of  the  subscript 
is  written  within  the  square  brackets. 

The  angle  brackets  <  >  enclose  lists  of  bit  names.   For  example, 
if  MIR  is  a  register,  then  MIR<4:0>  indicate  bits  0  through  4  of  reg- 
ister MIR  and  that  the  bits  in  MIR  are  numbered  from  right  to  left  in 
ascending  order. 

The  subscript  i,  i-1,  i+1  on  the  signal  names  indicates  the  index 
of  the  PE  originating  the  interface  control  signal. 

4.3.2.3.2  Description  of  R-control  -  The  function  of  the  R-control 
in  PE.  is  to  accept  for  processing  and  to  acknowledge  a  microinstruction 
from  the  adjacent  PE._,  ,  and  to  invoke  the  DM-control  for  the  processing 
of  the  microinstruction.  The  control  sequence  chart  for  the  R-control 
is  shown  in  Figure  4.22. 

The  R-control  indicates  its  readiness  to  PE._,  to  accept  another 
microinstruction  by  the  signal  RACK  :=1.   The  R-control  in  PE  monitors 
the  request  signal  TRQi_1  from  PE^.   The  active  state  of  TRQ  .  indi- 
cates that  information  on  input  port  MIP  is  valid  and  R-control  (control 
step  RC1)  loads  the  microinstruction  into  register  MIR<10:0>.   (It  is 
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assumed  that  the  selector  sMIR<4:0>  was  put  earlier  in  a  state  to 
select  MIP  input.)  Then  the  R-control  (control  step  RC2)  acknowledges 
the  receipt  of  the  microinstruction  by  the  control  signal  RACK. :=0. 
The  R-control  (control  step  RC3)  then  invokes  the  DM-control  for  the 
processing  of  the  microinstruction,  and  waits  for  a  reply  from  the 
DM-control.  The  reply  indicates  that  the  processing  is  finished  and 
R-control  can  accept  another  microinstruction  which  it  (R-control) 
indicates  to  PE._,  by  the  control  signal  RACK  :=1.  At  the  same  time, 
the  selector  network  sMIR<4:0>  is  set  to  select  the  data  from  micro- 
instruction input  port  MIP  .  This  is  done  in  control  step  RC4. 

It  is  assumed,  in  the  control  sequence  chart  of  Figure  4.22, 
that  initially,  at  the  power  turn  on,  RACK  :=1  and  MIPsMIR<4:0>:=1  are 
true. 

4.3.2.3.3  Description  of  DM-control  -  The  DM-control  can  be  looked 
upon  as  the  main  control  which  on  being  invoked  by  the  R-control  monitors 
the  output  of  the  microinstruction  decoder.  Depending  on  the  nature  of 
the  microinstruction,  it  sets  up  the  necessary  data  paths  and  conditions 
the  combinational  logic  in  the  data  flow  logic  of  the  PE.  After  the  data 
paths  are  set  up,  the  DM-control  invokes  one  or  more  of  the  other  con- 
trols, F,  E,  G  for  processing  and  T-control,  if  necessary  for  onward 
transmission  of  the  microinstruction  to  PE  . .   Since  the  selectors  have 
no  memory,  and  the  data  paths  remain  set  throughout  the  processing,  the 
output  of  the  microinstruction  decoder  can  be  directly  connected  to  the 
selector  signals  of  the  form  XsY:  =  1  and  involves  no  extra  logic  cost. 
Figures  4.23a,  b,  and  c  show  the  control  sequence  chart  for  the  DM-control. 
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The  data  path  for  the  microinstruction  SS  is  through  the  selector 

sADR  for  operand  register  INR2 ,  through  the  selector  sDSE  and  encoder 

DSE  for  the  result  (sum  of  the  contents  of  INR1  and  INR2)  digit  and 

finally  through  the  selector  sRIB.  The  encoder  DSE  converts  the 

redundant  binary  sum  digit  to  sign-magnitude  format  SM  .   This  data 

path  is  set  up  by  control  step  DMC1. .  The  control  signal  TAsTOP  sets 

up  the  data  path  for  the  'Adder  Transfer1  out  of  PE . .   The  microoperation 

P.  -*-Q   conditions  the  DSE  encoder  logic  for  proper  conversion  of  the  re- 

10 
dundant  binary  result  digit  into  the  equivalent  sign-magnitude  format. 

For  the  microinstruction  MS,  the  control  step  DMCl^  sets  up  the 
selector  sADR  for  the  source  operands  and  the  data  paths  for  the 
result  digit  through  the  selectors  sDSE  and  sRIB  and  for  the  'Adder 
Transfer'  through  the  selector  sTOP. 

If  the  microinstruction  to  be  processed  is  FMA,  the  control  step 
DMC1~  sets  up  the  necessary  data  paths — for  the  'product  array'  and 
'product  transfer'  through  sADR,  that  of  result  digit  through  sDSE  and 
sRIB,  of  'Adder  Transfer'  through  sTOP.  The  multiplicand  digit  from 
operand  register  INR2  is  put  on  the  register  output  port  ROP.  via 
selector  sROB.  The  control  memory  flip-flop  GFMA  acts  as  a  synchron- 
izing device  between  the  concurrently  active  and  interacting  controls — 

G  -control  and  G  -control.   It  is  initialized  to  state  '1'.  The 
gn  ap 

details  of  its  action  are  discussed  later  in  Section  4.3.2.3.6. 

Control  steps  DMC1,  through  DMClg  set  up  the  data  paths  for  the 
microinstruction  shown  at  the  entry  points  of  each  control  step  in 
Figures  4.23a,  4.23b,  and  4.23c.   For  the  left  shift  microinstruction 
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LS 


,  the  data  path  for  the  digit  in  PE.  is  from  the  internal  register  to 


the  output  port  ROP.  via  the  selector  sROB  and  the  data  path  for  the  in- 
coming digit  from  PE  .  is  from  RIP.  to  register  IBR  via  the  register 
APR  and  selector  sRIB.   In  the  case  of  microinstruction  RS,  the  digit  to 
be  stored  is  in  microinstruction  register  MIR<4:0>  and  its  corresponding 
data  path  to  the  input  of  register  file  is  via  the  selector  sRIB.   The 
data  path  for  the  digit  to  be  shifted  out  to  PE .   is  via  the  selector 
sROB,  the  bus  ROB  and  the  selector  sMIR<4:0>  to  register  MIR  and  thence 
to  port  MOP  .   For  the  inter-register  transfer  microinstructions  TD  and 
TI,  the  data  path  is  through  the  selector  sROB,  the  bus  ROB,  the 
selectors  sDSE  and  sRIB.   The  control  memory  flip-flop  SCHI  generates 
the  similar  named  control  signal  which  transforms  the  SM  -encoded  output 
of  ROB  into  redundant  binary  format  for  proper  transfer.  The  logical 
'1'  state  of  control  signal  SCHI  guarantees  that  the  magnitude  bits  of 
input  digit  on  ROB  will  appear  unchanged  at  the  output  of  DSE.   This 
can  be  seen  from  Figure  4.14  and  the  discussion  in  Section  4.2.2.6. 
The  data  path  for  the  microinstruction  LDC  is  from  the  register  MIR 
through  the  selector  sRIB  to  the  proper  register  INR[MIR<7 :5>]  of 
register  file  via  the  buffer  register  IBR. 

For  the  memory  access  microinstructions,  the  communication  of  data 
and  address  takes  place  via  the  ports  ROP.,  MIR.,  and  TOP..   For  the 
microinstructions  SPM  and  LPM,  the  data  path  for  the  address  of  the 

location  in  PEM  and  the  read/write  bit  is  from  MIR  to  TOP  via  the 
selector  sTOP.  However,  the  data  path  for  the  data  to  be  stored  in 
case  of  SPM  is  from  register  INR[MIR<7 :5>]  through  selector  sROB  to  the 
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output  port  ROP  .   But  the  data  path  from  memory  to  the  register 
INR[MIR<7 :5>]  for  microinstruction  LPM,  is  via  port  MIP  ,  selector 
sMIR<4:0>,  register  MIR<4:0>,  the  selector  sRIB  and  buffer  register 
IBR. 

The  data  path  for  the  microinstructions  AR  and  NR  is  from  the 
register  INR1  through  the  selectors  sROB,  sDSE,  encoder  DSE  and  the 
selector  sRIB  back  to  INR1  via  buffer  register  IBR.  Note  that  the  Op- 
codes for  the  microinstructions  AR  and  NR  are  so  chosen  that  bits 
MIR<7:5>  address  the  register  INR1.  This  explains  the  reasons  for  the 
OP-code  choices,  for  various  microinstructions  shown  in  Figure  4.20. 

After  the  data  paths  are  set  up,  the  DM-control  invokes  one  or 
more  of  G,  F,  E  and  T-controls  for  actually  processing  of  the  micro- 
instruction. The  microinstructions  SS,  MS,  FMA  and  LS  all  require 
G-information  from  their  right  neighboring  PEs.   So  the  DM-control 

invokes  the  G-control  consisting  of  G  -control  and  G  -control  and 

6     gn  ap 

the  T-control  in  parallel.  The  T-control  transmits  the  present  micro- 
instruction to  PE.,, .  The  identity  of  microinstruction  in  PE.  is 
essential  in  PE.,,  to  generate  the  G-information  for  the  microinstruc- 
tion processing.  The  control  flow  at  the  end  of  G  -control  branches 

ap 

directly  to  the  E-control  for  the  actual  calculation  and  storage  of 
the  result  digit. 

When  all  the  concurrently  invoked  subcontrols  are  finished,  they 
report  back  to  the  DM-control  at  the  invoking  point  in  control  flow. 
The  DM-control  now  replies  back  to  the  R-control  which  had  earlier 
invoked  DM-control.  The  R-control  was  in  a  state  of  active  suspension 
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(wait  state)  during  the  activity  of  DM-control.  The  R-control  now  gets 
ready  to  receive  another  command  as  explained  earlier. 

For  the  microinstruction  RS,  DM-control,  after  setting  up  the  data 
paths,  invokes  F-control  which  changes  the  modifer  field  of  the  micro- 
instruction in  MIR  and  PE,  for  transmission  to  PE, , . .   The  details  of 

i  i+1 

F-control  are  discussed  later.  At  the  end  of  F-control,  E-control  and 
T-control  are  invoked  in  parallel  but  no  G-information  is  required  for 
the  processing  of  microinstruction  RS.   On  return  from  both  the  con- 
currently active  E-  and  T-controls,  the  DM-control  replies  back  to  the 
waiting  R-control. 

In  the  case  of  microinstructions  TD,  TI,  LDC,  LPM  and  SPM  (Figure 
4.23b)  no  G-information  is  required.   Hence  DM-control  invokes  only  the 
E-control  and  T-control  in  parallel.  The  rest  of  the  control  flow  is 
as  it  is  for  RS. 

The  invoking  of  E,  G  and  T-controls  by  DM-control  for  microinstruc- 
tions AR  and  NR  is  more  complex  since  it  (invocation)  depends  on  the 
nature  of  the  data  resident  in  the  adjacent  PE .  . .  The  digit  algorithm 
of  the  microinstruction  AR  discussed  in  Section  3.6.5,  requires  knowing 
the  sign  of  the  first  non-zero  digit  to  the  immediate  right  of  the 
present  digital  position.   This  is  done  through  the  use  of  interface 
control  signals  S  ,  Z  and  control  memory  flip-flops  SAD,  SAK  and  ADZ 
which  respectively  stand  for  the  Sign  of  Adjacent  Digit,  Sign  of  Adjacent 
Digit  Known  and  Adjacent  Digit  is  Zero.  The  value  of  logical  '1'  for 
SAK  and  ADZ  indicate  assertion  or  truth  whereas  '0'  indicates  falsehood. 
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The  interface  control  signal  S.  (which  is  the  outputs  of  control  memory 
flip-flop  S.)  indicates  the  sign  of  the  digit  in  PE  's  accumulator  regis- 
ter.  Z   (which  is  also  the  output  of  flip-flop  Z  )  indicates  whether  the 
magnitude  of  the  accumulator  digit  is  zero.  Z.  =  1  indicates  that  the 
digit  is  zero  and  Z  =  0  indicates  otherwise.  Note  that  if  Z.  is  moni- 
tored by  adjacent  PE  , ,  validity  of  Z  can  only  be  ensured  when  RACK  =1; 
i.e.,  when  PE.  is  not  in  the  middle  of  executing  any  previous  micro- 
instruction. The  mechanism  for  determining  the  sign  of  the  first  non- 
zero digit  to  the  right  of  the  present  digital  position,  i  say,  is  as 
follows. 

If  the  digit  in  PE . , ,  is  zero,  G  -control  in  PE.  goes  into  a  wait 
°       i+1  ap  i 

loop.   In  the  meantime,  the  microinstruction  AR  is  passed  to  PE  .  where 
again  Z.  „  is  monitored  to  see  if  the  digit  in  PE.  _  is  zero.   If  it  is, 
it  (G  -control  in  PE   )  also  goes  into  a  wait  loop  and  the  micro- 
instruction passes  to  PE.  „,  PE.  _,  ...,  PE...  if  Z.   .  =  0  and 

Z..,,  Z .  , . ,  ...,  Z . .  ,  are  all  in  logical  state  '1'.  The  G  -controls 
i+3   i+4        i+j  °  ap 

in  PE  _,  ...,  PE.  ._-  go  into  the  wait  loop.  As  soon  as  Z    -  =  0  is 

monitored  by  PE . , . ,  G  -control  in  PE . . .  assigns  the  value  of  S . ,  , . , 
i+j   gn  i+j  i+j+1 

to  S . ,  ,  and  declares  the  sign  valid  to  the  waiting  G  -control  in 
i+j  &  6  ap 

PE...  ,  by  assigning  logical  state  ' 1'  to  the  control  signal  GRQ.   . 
i+j-l  i+j 

The  G  -control  in  PE.,   n  informs  the  G  -control  in  PE...  .    about  the 
ap  i+j-l  gn  i+j-l 

validity  of  sign  S .  . ,  by  setting  synchronizing  control  flip-flop  SAK, 
in  PE     ,  to  logical  state  'l1.  The  G  n~control  in  its  turn  assigns 
the  value  of  S    to  S      and  declares  the  sign  S 1+.,  to  be  valid. 
The  sign  of  the  digit  thus  flows  backward  till  PE  is  reached  and  in 
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this  way,  all  the  zero  digits  lying  to  the  immediate  right  of  digital 
position  i  are  assigned  the  sign  of  the  first  non-zero  digit. 

We  now  describe  the  action  of  DM-control  for  microinstructions  AR 
and  NR.   The  DM-control  checks  the  state  of  control  signal  Z   .  by 
monitoring  the  control  signal  RACK  .  as  explained  earlier.   If  the 
adjacent  digit  in  PE.  .  is  not  zero  (Z...  ^  1),  the  control  memory 
element  SAD  is  set  to  the  state  of  S  - t ,  the  sign  of  adjacent  digit 
is  declared  known  by  SAK  ■*-   1  and  the  adjacent  digit  is  declared  non- 
zero by  setting  ADZ  to  logical  state  '0'.   However,  if  Z    =  1,  the 
control  memory  flip-flops  SAD  and  SAK  are  set  to  logical  state  '0' 
and  flip-flop  ADZ  to  state  '1'. 

For  the  microinstruction  AR,  the  DM-control  then  invokes  T-control 
and  G-control  in  parallel  irrespective  of  the  state  of  the  control 
signal  Z   .   However,  for  microinstruction  NR,  no  digit  beyond  and 
including  the  first  (counting  from  left)  zero  digit  of  the  operand 
needs  to  be  recoded.   So  the  flow  of  microinstruction  NR  stops  as  soon 
as  Z    =  1  is  encountered.  This  is  done  by  the  DM-control  not  in- 
voking the  T-control  in  PE . .   However,  G  -control  and  G  -control  are 
invoked  for  uniformity  of  invoking  procedure,  although  G  -control  is 
Immediately  exited  for  microinstruction  NR  as  can  be  seen  from  the 
control  sequence  chart  for  G  -control  in  Figure  4.27. 

When  all  the  parallely  invoked  controls  have  finished,  the  DM- 
control  replies  back  to  the  waiting  R-control  which  gets  conditioned 
to  receive  another  microinstruction  for  processing  in  PE.. 
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4.3.2.3.4  Description  of  T-control  -  The  T-control,  when  invoked 
by  DM-control,  passes  the  microinstruction  in  register  MIR  of  PE.  to  the 
PE .  - .  The  control  sequence  chart  for  T-control  is  shown  in  Figure  4.24. 
The  T-control  in  PE.  monitors  the  signal  RACK  .  (from  the  R-control  in 
PE  - )  whose  logical  state  '1'  indicates  that  R-control  in  PE  .  is  ready 
to  accept  the  microinstruction.  The  control  step  TCI  sets  the  control 
memory  flip-flop  whose  output  gates  the  contents  of  MIR  onto  bus  MOP , . 
Then  in  control  step  TC2,  the  request  signal  TRQ.  is  activated  which  in 
turn  is  monitored  by  the  R-control  in  PE  -  .   As  soon  as  R-control  in 
PE...  accepts  the  microinstruction  from  MOP   (=MIP  .),  it  (R-control) 
acknowledges  by  assigning  the  '  0'  logical  state  to  acknowledge  signal 
RACK   .   The  '0'  state  of  RACK   ,  being  monitored  by  T-control,  signi- 
fies that  the  microinstruction  has  been  accepted  and  then  the  control 
step  TC3  withdraws  the  request  for  transmission  by  assigning  '0'  logical 
state  to  request  signal  TRQ.  and  also  removes  the  information  from  the 
bus  MOP.  (MIP   ) .  The  latter  is  necessary  for  microinstruction  LPM 
where  the  port  MIP    is  used  for  inputing  data  read  from  the  PEM  . 

At  the  end  of  the  control  sequence,  the  control  flow  returns  to 
the  point  where  the  t-control  was  invoked. 


4.3.2.3.5  Description  of  F-control  -  The  function  of  the  F-control 
is  to  modify  the  microinstruction  modifier  field  F  before  transmission 
of  the  microinstruction  to  the  next  PE,  i.e.,  PE  -,.  This  is  made  use 
of  in  the  microinstruction  RS  where  the  modifier  field  carries  the  digit 
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I 


RACK    :-l\  NO 


TCI 


MIRgMOP:  -  1 


I 


TC2 


TRQi:  -  1 


L 


TRO,:  -  0 
MIRgMOP:  -  0 


(   RETURN     ) 


Figure  4.24  Control  Sequence  Chart  for  T-control. 
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to  be  shifted  into  the  adjacent  PE.   Figure  4.25  shows  the  control  se- 
quence chart  for  F-control.   Control  step  FC1  loads  the  buffer  register 
IBR  from  the  output  of  the  selector  sRIB.  The  selector  sRIB  was  ini- 
tially conditioned  by  control  step  DMClj.  in  DM-control  to  select  the 
digit  from  MIR<4:0>.  At  this  time,  the  MIR<4:0>  carries  the  digit  from 
the  adjacent  PE._,  and  it  arrived  as  part  of  the  microinstruction  from 
PE.,.   Control  step  FC2  loads  the  MIR<4:0>  from  the  output  of  the  selec- 
tor sMIR<4:0>  which  was  conditioned  to  accept  the  digit  from  INR[MIR<7 :5>] 
in  PE.  to  be  shifted  into  next  PE.,,  by  control  step  DMCU.  At  the  end 
of  control  step  FC2,  the  control  flow  branches  to  initiate  E-control  and 
T-control  in  parallel.   The  T-control  would  transmit  the  microinstruction 
in  MIR  with  the  new  modifier  field  and  the  E-control  would  load  the 
register  INR[MIR<7 :5>]  from  the  buffer  register  IBR. 

4.3.2.3.6  Description  of  G-control  -  The  G-control  consists  of  two 

independent  subcontrols:  G  -control  which  generates  the  G-information 

gn 

(mainly  'Adder  Transfer'  t  , )  for  the  adjacent  PE.,;  and  G  -control 
which  accepts  the  G-information  from  the  adjacent  PE.  . .  The  G-control 
is  invoked  by  DM-control  only  when  the  processing  of  the  microinstruction 
requires  information  from  the  adjacent  PEs.  When  the  G-information 
depends  logically  on  more  than  one  adjacent  PE,  the  G  -control  and  G  - 
control  interact  with  each  other  through  synchronizing  control  memory 
flip-flops  GFMA  (in  the  case  of  microinstruction  FMA)  and  SAK  (for  the 
microinstructions  AR  and  NR) . 
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W  FC1 


gIBR:  -  1 


I 


FC2 


gMIR<4:0>:  -  1 


Figure  4.25  Control  Sequence  Chart  for  F-control 
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A. 3. 2. 3. 6.1  Description  of  G  -control  -  The  function  of  G  - 
t^r gn gn 

control  is  to  generate  G- information  needed  in  the  adjacent  PE._, . 

Figure  4.26  shows  the  control  sequence  chart  for  this  control.  The  G- 

information  for  the  microinstructions  SS  and  MS  consists  of  the  'Adder 

Transfer1  t  ,  which  is  routed  to  the  output  port  TOP  by  the  control 

steps  DMC1  and  DMC1„.  The  G-inf ormation  for  the  microinstruction  LS  is 

the  digit  in  register  INR[MIR<7 :5>]  which  is  routed  to  port  ROP  by  data 

path  set  up  in  control  step  DMC1, .  After  the  'G' -information  stabilizes 

on  ports  TOP  and  ROP . ,  control  step  Ggnl  informs  the  G  -control  in 

PE.  ,  about  the  validity  of  G-inf ormation.  The  G  -control  then  monitors 
i-1  '  gn 

the  acknowledgment  signal  GACK.  ..  from  G  -control  of  PE..  .   When  GACK.  - 

is  in  logical  state  ' 1',  the  control  step  Ggn2  declares  the  G-inf ormation 

not  valid. 

For  the  microinstruction  FMA,  the  G-information  ™,AG.  to  be 

FMA  i 

generated  by  PE.  for  PE  ,  consists  of  'Adder  Transfer'  t    and  the 

p 
multiplicand  digit  <f>   (assuming  'local  generation'  of  CPT  t  .).  The 

'Adder  Transfer'  t  ,  is  dependent  on  the  multiplicand  digit  <f>  and 

accumulator  digit  a.  in  registers  INR2  and  INR1  respectively  in  PE  , 

multiplicand  digit  <|>  .  and  accumulator  digit  a..-  in  registers  INR2 

and  INR1  respectively  in  PE. , -  and  the  multiplier  digit  m . .   t .  1  con- 

AO      Al 
sists  of  two  parts  t._.  and  t.  ,  and  is  generated  in  a  time  sequential 

manner.  The  process  of  generation  of  -^.G.  can  be  represented  in  the 

notation  of  Section  2.4  as  follows: 


SSv  MSv  LS 


Wait    for   Conforma- 
tion  to   stabilize  on 
port   TOP     or   ROP 


Ggnl. 


I 


FMA 


■i   r 


Wait   for  Conforma- 
tion   (cp    to 

stabilize  on  ports 
TOP     and   ROP 


Ggn3 


GFMA  -   1\N0 


YES 


Wait   for  G-informa- 
tion   (gJ)    to 

stabilize  on  port 
TOP, 


Ggn5 


GV, 


i 


f      RETURN  J 
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ARv  NR 


YES 
W         Ggn7 

S      -    SAD 

\  I        Ggn8 

GV      :    -   1 

I 


Figure  4.26  Control  Sequence  Chart  for  G  -control, 
— a -»  gn 
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fmagi  "  {ti-l  •  V 


Ci-1  =  r(mj'  *i'  V  <W  ai+l} 


=  (tA°    tA1  } 


AO      ,,0,  x 

where       t^  =  r  (m  ,  <$>±,   a^ 


Al      _1/    ,       .      A(k 

fci-i   *    r  (V  V  V  ♦i+i'  fci  } 


and        FMAGi  =  {ti-l»  'f-V   V 


where 


and 


= 

{Gi  -  Gl} 

•8 

= 

{ti-r  V 

t 

= 

(tA1  > 

lti-l'   ' 

In  the  above  relations,  the  subscript  FMA  has  been  dropped  from  all 
variables  for  ease  of  reading.  The  above  described  structure  for  G  and 
G.  can  be  deduced  from  an  examination  of  Figures  3.13  and  4. lid  to- 
gether . 

When  the  G-inf ormation  G.  is  valid  on  ports  TOP.  and  ROP.  which 
carry  the  t.  -  and  <j>  components  of  G.  respectively,  control  step 
Ggn3  informs  PE  -  about  the  validity  of  G  information.  After  PE  . 
has  accepted  (indicated  by  GACK.  1 :=1)  the  G . ,  control  step  Ggn4  sets 
validity  signal  GV  to  logical  state  '0'.  To  generate  G.,  it  is  neces- 
sary that  G.,,  («  {tA0,  <J>.})  be  available  in  PE . .  When  G...  from  PE... 
i+l      l    i  1        l+l        l+l 

is  valid  on  input  ports  TIP .  and  RIP . ,  the  G  -control  in  PE.  accepts 

I       i       ap  i 
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and  stores  G    and  informs  the  G  -control  about  its  availability  by 
setting  to  logical  state  ' 1'  the  control  memory  flip-flop  GFMA.   As 

soon  as  the  synchronizing  flip-flop  GFMA  is  in  logical  state  '1'  and 

1  1    Al 

G.  is  valid  on  port  TOP.  (G.  =  t.  ..  is  automatically  generated  by  the 

logic  in  MIAD  of  PE  ) ,  the  control  steps  Ggn5  and  Ggn6  declare  the  G 

information  valid  and  invalid  respectively  in  the  same  way  as  do  control 

steps  Ggn3  and  Ggn4 . 

For  the  microinstructions  AR  and  NR,  the  G-information  consists  of 

only  the  sign  of  the  digit  in  the  accumulator  of  adjacent  PE  _  and 

also  whether  the  digit  is  zero  or  not.   If  the  digit  in  the  accumulator 

in  PE.  is  non-zero  (Z .  ^  1) ,  then  no  G-information  needs  to  be  generated 

because  it  is  already  known  to  the  PE..  via  its  DM-control.   However, 

if  the  present  digit  is  a  zero  (Z .  =  1) ,  the  meaningful  sign  for  this 

zero  digit  is  the  sign  of  the  first  non-zero  digit  to  its  right.   If 

the  digit  to  the  immediate  right  in  PE...  is  non-zero,  then  the  sign  of 

the  adjacent  digit  is  known  and  is  stored  in  flip-flop  SAD  by  DM-control 

in  PE.  earlier.   If,  however,  the  digit  in  PE,  .  is  zero  (Z...  =  1),  the 

sign  of  the  adjacent  digit  is  unknown  (SAK  ^  1).   The  G  -control  goes 

into  a  wait  loop  continuously  monitoring  SAK  till  the  G  -control  in 

PE.  determines  and  stores  the  sign  in  SAD  of  the  digit  in  PE  _ .   As  soon 

as  the  sign  of  the  adjacent  digit  is  known,  control  step  Ggn7  assigns 

the  same  value  to  the  flip-flop  S  whose  value  represents  the  sign  of 

accumulator  digit  in  PE..   Control  step  Ggn8  informs  the  G  -control  in 

PEJ  .  about  the  validity  of  S . .   After  G  -control  in  PE .  ,  acknowledges 
i-1  i  ap  i-1 

the  receipt  of  valid  sign  S.  (GACK..  =  1),  control  step  Ggn9  withdraws 
the  validity  signal.  G  -control  now  returns  to  the  invoking  point  in 
DM-control. 
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A. 3. 2. 3. 6. 2  Description  of  G  -control  -  The  function  of  G  -control 
1 ap ap 

is  to  accept  the  G-information  generated  by  G  -control  in  PE .  . .  Figure 

4.27  shows  the  control  sequence  chart  for  G  -control.  This  G-information 

ap 

is  available  on  port  TIP.  (for  microinstructions  SS,  MS  and  FMA) ,  on  port 
RIP.  (for  microinstructions  FMA  and  LS)  and  on  interface  control  line 
S .  .  (in  case  of  microinstructions  AR  and  NR) . 

In  the  case  of  microinstructions  SS  and  MS,  the  G  -control  moni- 

ap 

tors  the  validity  interface  signal  GV.  - .  As  soon  as  the  G-information 
is  valid,  control  step  Gapl  stores  the  G-information  on  bus  TIP  into 
G-information  register  GIR.   Control  step  Gap2  acknowledges  the  receipt 
of  G-information,  and  control  step  Gap3  withdraws  the  acknowledgment 
signal  GACK  once  the  validity  signal  is  withdrawn  (GV  -  ■  0)  by 
G  -control  in  PE .  . .   For  microinstruction  LS,  the  same  sequence  is 
followed  except  that  G-information  is  available  on  RIP  and  is  stored 
in  register  APR  by  control  step  GapA. 

As  explained  earlier,  G-information  G  .  for  microinstruction  FMA 

0      A0  1      Al 

consists  of  two  components:   G  _(={t.  Ai+i })  and  G  _(={t.  }) .  When 

the  G  .  ,  information  is  valid,  the  control  step  Gap7  stores  t.   component 

in  register  GIR  and  <J>   .  component  in  register  APR.  Then  control  step 

Gap8  sets  the  synchronizing  flip-flop  GFMA  to  logical  state  '1'  to 

inform  the  G  -control  about  the  availability  of  G. ,.  information  so 
gn  '     i+1 

that  G  -control  may  generate  G.  for  PE  , .   Control  steps  Gap9  and 
GaplO  play  the  same  role  of  acknowledgment  assertion  and  its  withdrawal 

as  control  steps  Gap2  and  Gap3.  After  control  step  GaplO,  G  -control 

ap 

again  starts  monitoring  the  validity  control  signal  for  G.  . .  As  soon 
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Figure  4.27   Control  Sequence  Chart  for  G  -control. 
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as  G    is  valid  (GV    -  1)  on  TIP  ,  control  step  Gapll  stores  it  in 

G-information  buffer  register  GIR  and  then  control  steps  Gapl2  and  Gapl3 

respectively  acknowledge  the  receipt  of  G  .  information  and  withdraws 

the  acknowledge  signal  on  response  from  G  -control. 

gn 

For  the  microinstructions  AR  and  NR,  if  the  adjacent  digit  is  non- 
zero or  if  it  is  zero  and  the  microinstruction  is  NR,  no  G-information 

needs  to  be  accepted  from  adjacent  G  -control  and  the  G -control  is 

r  J       gn  ap 

immediately  exited. 

In  the  case  of  zero  adjacent  digit  and  microinstruction  AR,  G  - 

ap 

control  monitors  the  G-information  validity  signal.  Here  the  G-information 

consists  of  SJM.  As  soon  as  G  -control  has  determined  the  valid  sign 
i+1  gn 

for  S..-  (which  is  the  sign  of  first  non-zero  digit  to  its  right), 

G  -control  sets  the  validity  signal  GV...  to  logical  state  '1'.  As  soon 
gn  l+i 

as  G  -control  in  PE. ,,  finds  GV...  in  state  '1',  the  control  step  Gapl4 
ap  i+1        i+1 

stores  the  sign  S.  .  in  flip-flop  SAD  and  Gapl5  sets  the  synchronizing 
flip-flop  SAK  to  logical  state  '1'.   (SAK  is  being  monitored  by  G  - 
control  in  PE.  in  order  to  attach  this  sign  (stored  in  SAD)  to  S..) 
Control  steps  Gapl6  and  Gapl7  play  the  same  role  as  Gap2  and  Gap3. 

At  the  end  of  execution  of  G  -control,  the  control  sequence  for 
the  processing  of  the  microinstruction  branches  directly  to  E-control 
where  the  result  digit  is  calculated  and  stored  in  the  appropriate 
register  of  the  register  file. 


4.3.2.3.7  Description  of  E-control  -  Figure  4.28  shows  the  control 
sequence  chart  for  E-control.   For  the  microinstructions  SS,  MS,  FMA,  LDC 
and  LS,  the  E-control  loads  the  result  digit,  which  is  available  at  the 
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output  of  selector  sRIB,  Into  buffer  register  IBR.  This  is  done  in  con- 
trol step  E3.  Then  control  steps  E4~  and  E42  transfer  the  contents  of 
IBR  into  accumulator  register  INR1  for  microinstructions  SS,  MS,  FMA  and 
into  the  destination  register  INR[MIR<7 :5>]  for  LDC  and  LS.   Finally,  the 
control  step  E5  sets  the  status  indicators  S  and  Z  . 

For  the  microinstruction  RS,  the  control  step  E3  is  bypassed  and 
control  step  E42  loads  the  register  to  be  shifted.  E5  sets  the  digit 
status  indicators. 

For  the  inter-register  transfer  microinstructions,  the  state  of 
the  sign  bit  S^.,  of  the  digit  on  the  bus  ROB  is  transferred  to  the 
sign  bit  output  S.  of  digit  sum  encoder  DSE  for  TD.   The  complement  of 
the  state  of  S,,.,.  is  transferred  in  the  case  of  microinstruction  TI. 
This  is  done  in  control  steps  El-  and  El2  respectively.  The  control 
sequence  then  goes  through  the  control  steps  E3,  E4.  and  E5.   Control 
step  E4.  loads  the  destination  register  in  the  register  file. 

For  the  microinstruction  LPM,  control  step  E6.  requests  access  to 
the  local  memory  PEM.  of  the  PE..  Note  that  the  address  of  the  location 
in  PEM.  and  the  Read/Write  bit  (in  the  state  'Read')  is  already  available 
on  the  output  port  TOP  .  The  PEM.  reads  out  the  data  on  the  micro- 
in  PEM  and  the  Read /Write  bit  (in  state  Read)  is  already  available  on 
the  output  port  TOP . .  The  PEM.  reads  out  the  data  on  the  micro- 
instruction input  bus  MIP.  and  informs  the  PE.  by  the  logical  state 
' 1'  of  acknowledge  signal  MACK..  The  control  step  E6_  loads  the  register 
MIR<4:0>  from  the  output  of  selector  sMIR<4:0>  which  had  been  earlier 
conditioned,  in  DM-control,  to  accept  this  output  and  also  withdraws  the 
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request  for  memory  access.   The  control  steps  E3  and  E4„  load  the  buffer 
register  IBR  and  file  register  INR[MIR<7 :5>] .   Finally  the  control 
sequence  goes  through  control  step  E5  for  setting  the  status  indicators. 

For  the  store  microinstruction  SPM,  the  address  of  the  PEM  loca- 
tion is  already  available  on  output  bus  TOP  and  the  digit  to  be  stored 
is  on  bus  ROP  ,  when  the  control  flow  enters  E-control.   Control  step 
E7-  requests  access  to  the  memory.   Then  the  PEM  responds  by  the  logical 
state  '1'  of  acknowledge  signal  MACK  ,  after  accepting  the  data  and 
address  from  the  buses  TOP  and  ROP  .   Now  control  step  E7_  withdraws 
the  request  for  memory  access.   The  control  sequence  finally  goes 
through  status  setting  control  step  E5. 

For  the  microinstructions  NR  and  AR,  the  E-control  implements  the 
digit  algorithms  discussed  earlier  in  Sections  3.6.4  and  3.6.5.   Control 
steps  E2-  and  E2_  respectively  achieve  the  radix  complement  and  the  dimin- 
ished radix  complement  of  the  magnitude  bits  of  the  accumulator  digit. 
Control  step  E2_  diminishes  the  magnitude  of  the  accumulator  digit  by 
unity.  The  particular  setting  of  control  signals  to  states  shown  in  con- 
trol steps  E2..  ,  E2_,  and  E2  was  explained  earlier  in  Section  4.2.2.6. 
Control  step  E2,  assigns  the  state  of  MIR<4>,  which  is  the  sign,  S  p, 
of  the  whole  operand  to  be  assimilated  or  normalized,  to  the  sign  bit 
output  S.  of  digit  sum  encoder  DSE.   Control  steps  E_ ,  E4~  load  the 
result  digit  in  buffer  IBR  and  accumulator  register  INR1.   Finally  the 
control  step  E5  sets  the  status  indicators  regarding  the  sign  and  magni- 
tude of  the  accumulator  digit  in  PE.. 
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When  the  control  sequence  corresponding  to  E-control  Is  finished, 

the  control  flow  returns  to  the  invoking  point  where  G  -control  was 

invoked  in  DM-control.  This  is  because  the  control  flow  had  branched 

into  E-control  at  the  end  of  the  G  -control  sequence. 

ap 


4.4  Logic  Complexity  of  Processing  Element 

From  the  viewpoint  of  LSI  implementation  of  a  PE,  two  things  are  of 
major  importance:   the  number  of  circuit  elements  and  the  number  of 
external  pins  required  for  the  chip.   The  total  number  of  circuit  ele- 
ments and  pins  determine  the  silicon  real  estate,  density  of  the  circuit 
elements  and  the  heat  dissipation,  etc.   The  number  of  circuit  elements 
depend  on  the  technology  used  for  the  implementation  of  the  logic  on 
the  chip.   In  this  thesis,  we  shall  use  the  number  of  gates  as  an  in- 
direct measure  of  logic  complexity  because  the  number  of  circuit  elements 
are  directly  related  to  the  number  of  gates.   Further,  a  multi-input 
NAND  gate  is  considered  equivalent  to  a  2-input  NAND  gate  because  in 
TTL  logic,  a  multi-input  NAND  is  realized  by  the  use  of  a  multi-emitter 
transistor.   These  assumptions  have  been  made  for  the  sake  of  simplicity. 

The  overall  gate  complexity  and  pin  complexity  of  a  PE  must  take 
into  account  the  gates  and  pins  required  by  a  PE's  major  components: 
DPL,  PE  control  logic  and  Register  File. 

4.4.1  Logic  complexity  of  DPL 

4.4.1.1  Gate  complexity  of  digit  processing  logic  DPL  -  The  total 
number  of  gates  required  for  the  DPL  is  equal  to  the  sum  of  the  gates 
necessary  for  its  various  components:  Adder  MIAD,  Digit  Product 
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Generator  DPG,  Digit  Sum  Encoder  DSE,  various  selector  networks  sADR, 

sDSE,  sROB,  sRIB,  and  sTOP  and  the  storage  buffer  registers  in  the  DPL. 

The  gates  required  for  MIAD,  DPG,  DSE  and  selectors  sADR  and  sDSE  are 

dependent  on  the  choice  of  the  logic  vector  encoding  for  the  redundant 

binary  digit.   From  the  earlier  discussion  in  Sections  4.2.2.2  through 

4.2.2.6,  it  is  clear  that  logic  vector  encoding  LVE„  is  the  simplest 

encoding  and  requires  the  least  number  of  gates  for  the  implementation 

of  DPG,  sADR  and  sDSE.   In  the  following,  we  shall  calculate  the  gate 

complexity  of  DPL,  assuming  only  the  sign-magnitude  (SM,  )  logic  vector 

encoding  LVE_  . 

Let 

G     =  Total  number  of  gates  required  for  DPL,  excluding 
storage  registers, 

G     =  Gates  required  for  the  logic  implementation  of  Digit 
Product  Generator,  DPG, using  'local  generation'  of 

G„,TAT>  =  Gates  required  for  the  radix-2  adder,  MIAD, 
MIAD 

G     =  Gates  required  for  Digit  Sum  Encoder,  DSE, 

and  let  GsDSE'  GsRIB'  GsROB'  GsADR  and  GsTOP'  respectively  denote  the 
gates  required  for  the  selectors  sDSE,  sRIB,  sROB,  sADR  and  sTOP. 
From  the  design  details  described  in  Section  4.2,  it  is  clear  that 

2  A 

G„TAT.  =  26K  NANDs,  assuming  no  encoder  MATE  for  t.  , 
MIAD  l-l 

2 
G     =  K  ANDs  +  2  Exclusive-OR  gates  for  sign  generation 


DPG 


2 
=  K  +  8  gates 


considering  a  AND  and  NAND  gate  equivalent,  and  1  Exclusive-OR  gate 
equivalent  to  4  NAND  gates.   From  Equations  (4.17)  through  (4.22),  we  have 
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GDSE  "  16K  +  Cl 
G8ADR  "  3K2  +  3K  +  1 


G8DSE  "  7K 


GSRIB  "  4(K  +  » 


G8R0B  "  (K  +  "  (K  +  2) 


GsTOP  "  3(K  +  2) 


Therefore,  the  total  number  of  gates  required  for  the  combinational 
processing  logic  DPL  is  given  by 

GDPL  "  GMIAD  +  GDPG  +  GDSE  +  GsADR  +  GsDSE  +  GsRIB  +  GsROB 

+  GsTOP 

-  26K2  +  (K2  +  8)  +  (16K  +  c^  +  (3K2  +  3K  +  1)  + 
7K  +  4(K  +  1)  +  (K  +  1)  (K  +  2)  +  3b. 
Ignoring  the  constant  c.  and  assuming  the  width  b  of  port  TOP. 
to  be  equal  to  (K  +  2) ,  we  have 

GDpL  -  31  K2  +  36K  +  21  (4. 24) 

In  the  expression  above  for  G   ,  the  sum  of  the  gates  contributed 

2 
by  the  three  major  components  DPG,  MIAD  and  DSE  alone  is  (27K  +16K+8+C.) 

and  forms  the  bluk  of  the  gates  required  for  the  implementation  of  DPL. 

The  other  components  like  selector  networks  contribute  progressively 

smaller  and  smaller  percentage  of  gates  to  the  gate  complexity  of  DPL,  as 

the  value  of  K  increases.  Table  4.2  lists  the  values  of  the  gates 
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required  for  various  components  of  the  DPL  as  a  function  of  radix-r  and 
the  last  column  in  the  table  shows  the  gate  complexity  of  the  combina- 
tional processing  logic. 

4.4.1.2  Pin  complexity  of  DPL  -  Pin  complexity  is  independent  of 
the  logic  vector  encoding  chosen  for  a  redundant  binary  digit.  The  pins 
required  for  digit  processing  logic  DPL  is  the  sum  of  the  pins  necessary 
for  input  ports  TIP  ,  RIP  and  output  ports  TOP  and  RIP..  Pins,  P  p 
and  PonT}  required  for  ports  RIP .  and  ROP . ,  respectively  are  dependent 

P       P 

on  the  method  of  generation  of  'product  transfer'  t.  and  t.  ,,  respec- 
tively.  Similarly,  PTTp  ,  the  pins  required  for  port  TIP.  is  dependent 
on  the  method  of  encoding  used  for  the  'Adder  Transfer'  t  .  The  port  TOP. 
is  shared  both  by  the  Read/Write  and  address  lines  for  PEM.  and  the  'Adder 
Transfer'  t .  , .  PTnp  »  tne  pins  required  for  TOP.  is  the  larger  of  the 
number  of  pins  required  for  t.  and  PEM  address  lines.  The  total  number 
of  pins,  PT  necessary  for  the  logic  implementation  of  DPL  is  equal  to  the 
sum  of  the  pins  required  for  input  and  output  ports. 


p  ■  p     +p     +p     +p 
T   rTOPi    TIP±   rRlP     ROP± 


Let 


A  P 

P  _    ■  Pins  required  for  generating  t  .  using  'Adjacent 

t.  ,     Generation'  method. 


L  P 

P  p    ■  Pins  required  for  generating  t  ,  using  'Local 

t,  ,     Generation'  method. 


R  P 

P  p    =  Pins  required  for  t.  .  using  ROM  for  DPG. 


*i-i 
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NE  A 

P  .     =  Pins  required  for  t.  ,  without  encoder  MATE. 
A  ^  i-1 

i-1 

E  A 

P  A    =  Pins  required  for  t    ,  using  encoder  MATE. 

t  1"1        ~ 

i-1 

R  A 

P  A    =  Pins  required  for  t   ..  ,  using  ROM  for  DPG. 
a  i-l        - 

i-1 

If  the  Read  Only  Memory  (ROM)  is  used  for  the  implementation  of  DPG, 
and  P   is  the  total  number  of  pins  required  for  DPL,  using  ROM  for  DPG, 
then 

PR     =  Max  (k  +  2,  PRA  )  +  PR  +  PR  +  PR 

'±-1     Ci    Ci    Vl 

=  Max  (k  +  2,  4)  +  4  +  (k  +  1)  +  (k  +  1) 

=   (k  +  2)  +  4  +  (k  +  1)  +  (k  +  1) 

=  3k  +  8  (4.25) 

Another  interesting  configuration  for  the  DPL  implementation  is 

p 

when  DPG  uses  the  'Local  Generation'  method  for  t.  and  the  encoder  MATE 

l 

A         EL 
is  used  to  reduce  the  pins  required  for  t  ._-,•   If  PT  denotes  the  total 

pins  required  for  such  a  configuration,  then 

P^L    =  Max  (k  +  2,  PEA  )  +  PEA  +  PLp  +  PLp 

ti-l     l±  ti    ti-l 

=  Max  (k  +  2,  2|log2(k  +  1)1)  +  2[log2(k+l)~|+(k+l)+(k+l) 

=   (k+2)  +  2pLog2(k+l)"7  +  2(k+l) 

=  3k  +  4  +  2  [log9 (k+l)"l  (4.26) 
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Still  another  implementation  configuration  of  interest  for  compari- 
son purposes  is  the  one  which  uses  no  encoder  for  the  'Adder  Transfer' 

and  the  'Adjacent  Generation'  method  is  used  for  collective  product 

P        NEA 
transfer  t . .  Let  P    be  the  total  number  of  pins  required  for  such  a 

configuration.  Now 

Pf*  -  Max(k+2,  PN*  )  +  P1*  +  PA„  +  p* 

=  2k  +  2k  +  (k^"1^  +  1)  +  (k(ip^  +  1) 

-  4k  +  k(k-l)  +  2 

-  k2  +  3k  +  2  (4.27) 

Finally,  we  have  a  configuration  which  uses  no  encoder  for  t  . 
and  the  'Local  Generation'  method  for  t . .  The  total  pins  P_   for  this 
configuration  is  given  by  the  following: 

Pf L  =  Max(k+2,  PN=  )  +  PN*  +  pLp  +  p^ 

ti-l     t±         Ci    'i-l 

-  2k  +  2k  +  (k+1)  +  (k+1) 

-  6k  +  2  (4.28) 

Values  of  Equations  (4.25) ,(4.26) \  (4.27)  and  (4.28)  are  tabulated  in 
Table  4.3  for  various  values  of  the  parameter  k.  It  shows  that  the  con- 
figuration using  ROM  for  DPG  requires  the  least  number  of  pins.  However, 

2  k+1 
the  bit  capacity  (=k2         )  of  ROM  required  for  values  of  k  >_  6  becomes 

p 
too  large  and  hence  not  suitable.  However,  'Local  Generation'  of  t.  and 
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Table  4.3  Pin  Complexity  of  DPL  Vs  Radix  for  h   <  <S  <  1 


radix 

t 

:  =  2 
k 

PR 
T 

pEL 
T 

NEA 
T 

NEL 
T 

4 

2 

14 

14 

12 

14 

8 

3 

17 

17 

20 

20 

16 

4 

20 

22 

30 

26 

32 

5 

23 

25 

42 

32 

64 

6 

26 

28 

56 

38 

128 

7 

29 

31 

72 

44 

256 

8 

32 

36 

90 

50 

PT    =  Total  pins  for  DPL  using  Read-only-Memory  DPG. 

A 

■  Total  pins  for  DPL  using  Encoder  for  Adder  Transfer  t    and 
'Local  Generation'  of  t?. 

=  Total  pins  for  DPL  without  Encoder  for  t  ,  and  'Adjacent 
Generation'  of  t^. 

=  Total  pins  for  DPL  without  Encoder  for  t    and  'Local  Genera- 
tion' of  tP.  i 


T 

EL 
T 

,NEA 
T 

,NEL 
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use  of  the  encoder  MATE  for  'Adder  Transfer'  gives  reasonable  pin  count 
but  at  the  cost  of  introducing  a  new  cell  (full  adder)  for  MATE  in  the 
realization  of  MIAD. 

4.4.1.3  Effect  of  multiplier  digit's  redundancy  on  gate  and  pin 
complexity  of  PPL  -  In  the  discussion  so  far,  we  have  assumed  that  both 
the  multiplier  and  multiplicand  digits  are  maximally  redundant,  that  is, 
both  can  assume  values  equal  to  or  less  than  (r-1) .  However,  if  the 
multiplier  digit  has  a  redundancy  ratio  6  lying  between  1/2  and  2/3, 
that  is,  the  maximum  magnitude  of  the  multiplier  digit  is  <_  | 2/3  (r-1) | , 
then  the  multiplier  digit  can  be  recoded  in  Non-Adjacent  Format  (NAF) 
[43,49] .  In  a  NAF  recoded  radix-r  multiplier  digit,  no  two  adjacent 
redundant  binary  digits  are  nonzero.  That  is,  the  recoded  multiplier 
x.  is  of  the  form 


k-1   *      4  * 

*4   "   I   x   '  .  2J  x.   e  (1,0,1} 

J-0   *j  *j 


such  that 


\ 


xi 

J+l 


J4  1  where  |x 
1   J 


is  the  absolute  value  of 


the  redundant  binary  digit  x 


V 


With  the  multiplier  digit  in  NAF  format,  the  number  of  inputs  to 

+  1)  from  (k+1),  by 
combining  two  adjacent  redundant  binary  digits  of  a  column  into  one 


all  MIRBAs  of  radix-2  adder  can  be  reduced  to  ( 


redundant  binary  digit.  This  is  shown  in  Figure  4.29.  The  reduction 
in  the  number  of  inputs  to  MIRBAs  of  the  radix-2  adder  causes  a  corre- 
sponding decrease  in  the  gate  and  pin  complexity  of  the  Digit  Processing 
Logic . 
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Gate  Complexity 

Assuming  the  sign-magnitude,  logic  vector  encoding  LVE_  for  a  re- 
dundant binary  digit,  and  'Local  Generation'  of  the  product  transfer 

p 
t.,  the  gates  G«pr  for  the  logic  of  the  Digit  Product  Generator  are 


given  by 


G^tj/-.  "  gates  required  for  the  magnitude  bits  of  the  'product' 
DPG 

p 

array  w.  and  'transfer'  arrays  t.  +  gates  required  for 

the  generation  of  sign  bits  of  'product'  and  'transfer' 
arrays  +  gates  required  to  combine  adjacent  bits  and 
their  corresponding  signs  (one  of  the  bits  is  zero) 
to  form  a  single  (composite)  redundant  binary  digit 
shown  by  circles  in  Figure 

2 
■  k  +  8k  +  Total  #  of  composite  redundant  binary  digits 

X  gates  required  to  form  one  composite  redundant  binary 

digit. 


k2  +  8k  +  k 


T 

2 


x  4  (4.29) 


The  above  expression  shows  that  the  gates  required  for  Digit  Product 
Generator,  DPG  are  increased  for  1/2  <^  6  <^2/3  compared  to  the  maximal 


redundancy  case  by  an  amount  equal  to  4k 


+  8(k-l). 


Further,  the  complexity  of  the  selector  network  sADR  is  also  in- 
creased because  the  composite  redundant  binary  digit  has  to  be  individ- 
ually routed  to  the  input  of  the  MIRBAs  through  the  selector  sADR. 


//  of  gates  needed  for  the  magnitude  bits  selection  ■  3k 

[k 
#  of  gates  required  for  sign  bits  selection  ■  (2k+l)  ■=■ 
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total  #  of  gates  required  for  sADR  network  =  (2k+l) 


+  3k 


GsADR=   (5M) 


(A. 30) 


From  the  above  it  is  clear  that  although  the  number  of  gates 
required  for  the  sign  bits'  selection  is  increased  compared  to  the  case 
when  6=1,  the  number  of  gates  required  for  the  magnitude  bits'  selec- 
tion is  almost  halved  and  the  overall  gates  G'AT-_,  required  for  the 

°  sADR   n 

selector  network  sADR  is  decreased  compared  to  the  maximal  redundancy 

case. 

However,  there  is  a  drastic  reduction  in  the  number  of  gates 

required  for  the  adder  MIAD  because  of  the  decrease  in  the  number  of 

inputs  to  each  MIRBA.   The  gates  G/,_._  required  are 

MIAD 


GMIAD 


=  26k 


(A. 31) 


There  is  no  change,  due  to  change  in  redundancy,  in  the  gates  required 
for  either  the  Digit  Sum  Encoder  DSE  or  the  other  remaining  selector 
networks  sDSE,  sRIB  and  sTOP.   Therefore,  the  total  number  of  gates 
G'   required  for  the  Digit  Processing  Logic,  when  the  multiplier  digit 
redundancy  is  restricted  to  1/2  <  6  <   2/3  only  is 


GDPL   "  GDPG  +  GsADR  +  GMIAD  +  GDSE  +  GsDSE  +  GsRIB  + 


GsR0B  +  °sT0P 


The  gates  for  sROB  are  calculated  on  the  assumption  that  we  reduce 


the  number  of  registers  in  the  register  file  from  (k+1)  to 


+  2) 
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=  k  +  8k  +  4k 


+  (5k+l) 


f 


+  16k  +  7k  + 


4(k+l)  +  (k+1)  ( 


k  +  (36k  +  2) 


+  2)  +  3(k+2) 


+  40K  +  12 


(4.32) 


The  values  of  G'   and  its  various  components  are  given  in  Table  4.4 

L/JTIj 

for  different  values  of  the  parameter  k.   A  comparison  of  Table  4.2  and 
Table  4.4  shows  that  the  reduction  in  the  gates  required  for  digit 
processing  logic,  for  1/2  <_  6  <_  2/3,  comes  mainly  from  the  drastic 
reduction  in  the  number  of  gates  necessary  for  the  adder  MIAD. 


Pin  Complexity 

Using  the  same  notation  as  in  the  case  of  %  <  6  <  1,  we  have 


.NEL 


Similarly 


.EL 


=  Total  number  of  pins  required  for  implementation 

p 
of  DPL,  using  'Local  Generation'  method  of  t.  and 

no  encoder  for  t    for  1/2  <_  6  <_  2/3 


-  +  (k+1)  +  (k+1) 


Max(k+2,    2 

Y 

2 

)   +  2 

"k 

2 

(k+2)     +  2 

Y 
2 

+   (2k+l) 

3k  +  4  +  2 

Y 
2 

(4.33) 


-  Max (k+2,  2  ^ 


"k 

2 

)   +  2 

log2 

M 

+  (k+1)  +  (k+1) 


t 


Strictly  speaking,  6  = 


(r-1) 


A- 


1)  which  may  be  slightly 


larger  than  2/3  for  certain  values  of  r.   In  this  thesis,  however,  we 
shall  say  that  6  <  2/3. 
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Table  4.4  Gate  Complexity  of  DPL  Vs  Radix  for  1/2  <_  6  <_  2/3 
and  Encoding  LVE„  for  a  Redundant  Binary  Digit 


Radix 

0k 
r=2 

GDPG 

GMIAD 

GsADR 

GDSE 

GsDSE 

GsRIB 

GsR0B 

GsT0P 

GDPL 

r 

k 

4 

2 

28 

52 

11 

32 

14 

12 

9 

12 

170 

8 

3 

57 

156 

32 

48 

21 

16 

16 

15 

361 

16 

4 

80 

208 

42 

64 

28 

20 

20 

18 

480 

32 

5 

125 

390 

78 

80 

35 

24 

30 

21 

783 

64 

6 

156 

468 

93 

96 

42 

28 

35 

24 

942 

128 

7 

217 

728 

144 

112 

49 

32 

48 

27 

1357 

256 

8 

256 

832 

164 

128 

56 

36 

54 

30 

1556 

Table  4.5  Pin  Complexity  of  DPL  Vs  Radix  for  1/2  <  6  <  2/3 


Radix 
r 

0k 
r=2 

k 

.EL 
T 

.NEL 
T 

4 

2 

12 

12 

8 

3 

17 

17 

16 

4 

20 

20 

32 

5 

23 

25 

64 

6 

26 

28 

128 

7 

31 

33 

256 

8 

34 

36 
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3k  +  A  +  2 


log2 


*+■* 


(4.34) 


Values  of  both  the  Equations  (A. 33)  and  (A.3A)  are  tabulated  in  Table 
A. 5. 

A  comparison  of  Tables  A. 3  and  A. 5  shows  that  by  restricting  the 
redundancy  ratio  to  1/2  <_  6  <_  2/3  for  each  multiplier  digit,  one  can 
achieve  almost  the  same  number  of  total  pins  for  DPL,  as  are  achieved 
by  using  6=1  and  encoder  MATE,  without  having  to  introduce  the  new 
cell  for  MATE.  The  introduction  of  MATE  destroys  the  uniformity  of 
the  structure  of  MIAD. 

A. A. 2  Logic  complexity  of  PE  control  -  The  major  components  of  PE 
control  logic  are  the  microinstruction  register,  MIR,  the  selector  net- 
work sMIR,  the  Zero  and  Sign  Detection  Logic  ZSD,  the  microinstruction 
decoder  and  control  and  timing  signal  generator,  TCS.   Of  these,  the 
gate  complexity  of  only  the  selector  sMIR  and  ZSD  is  dependent  on  the 
bit  width  (=k+l)  of  the  PE  module  because  each  is  one  digit  wide.   The 
gate  complexity  of  TCS  is  independent  of  bit  width,  if  we  exclude  the 
file  register  address  decoders  from  consideration.   However,  the  gate 
complexity  is  dependent  on  the  method  of  implementing  the  control  sequence 
charts  described  earlier.   The  author  used  the  control  point  technique 
used  in  ILLIAC  III  [A5]  for  the  implementation  of  control  sequence  charts, 
in  order  to  calculate  the  gate  complexity  of  PE  control. 
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4. A. 2.1  Gate  complexity  of  PE  control  -  Table  4.6  shows  the  gates 
required  for  each  subcontrol  of  TCS  in  terms  of  the  number  of  control 
points,  gates  for  the  control  points  and  the  gates  required  for  the 
conditional  generation  of  control  and  timing  signals. 

The  last  column  of  the  Table  4.6  shows  the  total  number  of  gates 
required  for  each  subcontrol.   Let  G  q  denote  the  total  number  of  gates 
required  for  the  Timing  and  Control  Signal  Generator,  TCS. 

G     =  200  NAND  gates 

In  addition,  let  G^-.,  G  .,__  and  G„-,~  denote  the  gates  required  for 
DCD    sMIR      ZSD  ° 

the  logic  implementation  of  the  microinstruction  decoder,  selector 
sMIR<4:0>  and  the  Zero  and  Sign  Detector.   These  gates  are  given  by 

GDCD  =  32  NAND  gates 

GsMIR  =   15  NAND  gateS 
GZSD  =   6  NAND  Sates 
Therefore 

G     ■  Total  #  of  gates  required  by  PE  control 
logic  excluding  storage  elements 

=  GTCS  +  GDCD  +  GsMIR  +  GZSD 

=  253  (4.35) 

4.4.2.2  Pin  complexity  of  PE  control  -  The  total  number  of  pins 
required  for  the  logic  implementation  of  PE  local  control  is  the  sum  of 
the  pins  required  for  microinstruction  ports  MIP  and  MOP.  and  the  pins 
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required  for  the  request-response  signals  of  TCS.  Denoting  by  "P-pm  »  t'ie 
total  pins  required  by  PE  local  control,  we  have 

P        P     +  P     +  P 
PCL   =   MIP.    rMOP.    rTCS 
1       i 

=  11+11+14 

=  36  (A. 36) 

If  the  multiplier  digit  has  redundancy  1/2  <  6  <  2/3,  then  the 


number  of  internal  registers  in  the  PE  reduces  to 


+  1  from  (k+1). 


The  number  of  address  bits  required  to  specify  the  internal  register 


correspondingly  reduce  to 


log2( 


+  1) 


This  results  in  the  saving 


of  one  pin  in  the  microinstruction  ports  and  thus  the  pins  required  for 
PE  control  logic  reduce  to  34. 

4.4.3  Overall  logic  complexity  of  a  PE  -  The  total  number  of  gates, 
G   ,  required  for  the  implementation  of  a  PE  is  the  sum  of  the  gates  re- 
quired  for  the  combinational  logic  of  DPL,  the  gates  required  for  the  PE 
control  logic  and  the  gates  required  for  the  implementation  of  storage 

registers  in  the  PE.  The  gates  required  for  DPL  and  the  storage  regis- 

k 
ters  are  a  function  of  the  parameter  k  (radix-2  )  which  represents  the 

bit  width  of  a  PE.  The  gates  required  for  PE  control  logic  are  virtually 

independent  of  k  and  are  about  250  NANDs.  The  storage  registers  in  a  PE 

comprise  the  registers  in  the  register  file,  buffer  registers  in  DPL  and 

the  register  MIR  in  PE  control  logic.   Considering  that  all  the  storage 

registers  are  made  of  edge  triggered  D-type  flip-flops,  the  gates  G  Q 

required  for  the  storage  registers  is  given  by 
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G  _  -   (//  of  registers  in  the  register  file  x  (k+1)  + 

width  of  IBR  +  width  of  APR  +  width  of  GIR  +  width 
of  MIR)  x  gates  required  for  one  D-type  edge  trig- 
gered flip-flop. 
A  D-type  edge  triggered  flip-flop  requires  6  NAND  gates  [46].  There- 
fore, for  multiplier  digit  redundancy  ratio  ^  <  6  <  1,  we  have 


GSTO     =      6X 


=     6 


(k+1) (k+1)   +  (k+l)+(k+l)+  2|log2(k+lT|   +  11 
k2  +  4k  +  2[Tog2(k+l)]   +  141 


(4.37) 


For  multiplier  digit  redundancy  ratio  1/2  <  6  <  2/3, 


STO 


=     6x 


=     6 


|3(k+l)  +   Ij]    (2  +  k+1)   +  loj 

hk  +  (k+3)   fyl  +  13J 


+  l)(k+l)   +   (k+l)+(k+l)+  2 
(2  +  k+1)   +  10 
=     6 1 3k  +  (k+3) 


HfUig 


(4.38) 


Now  for  h  <   <S  <   1 


GPE   =  GDPL  +  GPCL  +  GST0 


Substituting  the  values  from  Equations  (4.24) , (4.35)  and  (4.37) 


GpE   =  37k2  +  60k  +  12  |7og2(k+l)"|  +  359 


(4.39) 


Similarly  for  1/2  <  6  <  2/3 


GPE   "  GDPL  +  GPCL  +  GST0 


=  k  +  k(50  +  42 


)  +  20 


+  351 


(4.40) 
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The  total  number  of  pins  required  for  the  logic  implementation  of 
the  PE  is  the  sum  of  the  pins  necessary  for  the  DPL  and  the  pins  required 
for  the  PE  control  logic.   From  the  earlier  discussion,  we  find  that  by 
restricting  multiplier  (quotient)  digit  redundancy  to  1/2  <_  6  <_  2/3,  a 
reduction  results  both  in  gate  complexity  and  the  total  number  of  pins 
required  for  the  PE.  Moreover,  the  reduction  in  pins  is  achieved  without 
destroying  the  cellularity  and  structural  uniformity  of  the  multi-input 
adder,  MIAD.  The  total  number  of  pins,  PpTr,  for  the  PE  is  given  by, 
for  1/2  <  6  <  2/3 


PE 


=   (3k  +  4  +  2 


)  +  34 


=   (3k  +  38  +  2 


(4.41) 
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5.   INTERACTION  WITH  MEMORY 

5.1  Introduction 

The  arithmetic  structure  consisting  of  Mantissa  Processing  Logic 
(MPL)  and  Exponent  Processing  Logic  (EPL)  needs  to  communicate  with 
Data  Main  Memory  (DMM)  for  fetching  the  operands  and  for  storing  the 
results.   The  major  considerations  in  the  design  of  the  interface 
between  the  MPL  and  the  DMM  are: 

a)  a  PE  in  the  MPL  should  require  minimum  possible  number  of  pins 
for  the  addresses  of  the  operands,  and 

b)  the  interface  and  the  DMM  should  have  high  data  bandwidth  to 
satisfy  the  data  needs  of  concurrently  processing  PEs. 

The  first  point  suggests  that  the  DMM  address  of  the  operand  should 
not  be  carried  as  part  of  the  microinstruction  issued  by  the  mantissa 
control  unit,  MCU.   Instead  we  should,  along  with  the  microinstruction, 
send  in  the  modifier  field  either  a  pointer  to  an  address  register  in 
the  interface  [30]  or  the  address  of  some  location  in  a  small  size 

buffer  memory  (in  the  interface)  which  is  preloaded  with  the  operands. 
In  this  thesis,  the  latter  approach  is  adopted. 

Since  the  different  PEs  in  the  mantissa  processing  logic,  MPL,  are 
concurrently  active  and  in  general  operating  on  digits  of  different 
operands,  the  different  PEs  may  be  accessing  the  interface  simultane- 
ously.  This  requires  high  bandwidth  and  leads  to  a  distributed,  multi- 
port  architecture  for  the  buffer  memory.   In  Section  5.2,  the  details 
of  the  architecture  of  the  mantissa  buffer  memory  called  the  Local 
Operand  Mantissa  Memory  (LOMM)  are  described. 
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Because  the  operand  address  in  the  LOMM  instead  of  the  DMM  address 
of  the  operand  is  carried  with  the  microinstruction  issued  by  MCU,  a 
mapping  mechanism  is  necessary  in  the  interface.   Besides,  the  LOMM  is 
of  finite  size  and  its  contents  need  to  be  stored  away  in  the  DMM  to 
make  space  for  new  operands.   A  brief  description  of  the  mapping  mech- 
anism, the  loading  (storing)  of  the  operands  (results)  from  (to)  DMM  into 
(from)  LOMM  is  given  in  Section  5.3.   Finally,  Section  5.4  describes 
some  of  the  factors  which  determine  the  word  capacity  of  the  buffer 
memory. 

The  Local  Operand  Memory  for  floating  point  operands  consists  of 
two  parts:   the  Local  Operand  Mantissa  Memory,  LOMM  and  the  Local  Operand 
Exponent  Memory,  LOEM.   Figure  5.1  shows  the  block  diagram  of  Local 
Operand  Memory.   The  architecture  of  LOMM  and  LOEM  are  independent  of 
each  other  and  depend  respectively  on  the  organization  and  architecture 
of  the  Mantissa  Processing  Logic  and  the  Exponent  Processing  Logic.   In 
this  thesis,  we  shall  be  concerned  only  with  the  architecture  and  organi- 
zation of  LOMM  and  refer  to  it  as  the  buffer  memory.   Whenever  we  speak 
of  the  buffer  memory  operand,  the  exponent  part  of  the  operand  is  under- 
stood to  be  in  the  LOEM.   The  LOEM  can  be  organized  in  many  ways.   Since 
the  number  of  bits  required  for  the  operand  exponent  is  not  too  large, 
LOEM  could  consist  of  simply  one  memory  module  of  word  size  equal  to  the 
bit  width  of  the  exponent  and  word  capacity  equal  to  that  of  a  PEM 
module. 
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5.2  Organization  of  Local  Operand  Mantissa  Memory,  LOMM 

The  buffer  memory,  LOMM  (Figure  5.1)  consists  of  as  many  memory 
modules,  called  PEMs  as  there  are  PEs  in  the  Mantissa  Processing  Logic. 
One  PEM  is  associated  with  one  individual  PE  and  has  the  same  bit  width 
(length  of  an  individual  PEM  word)  as  the  bit  width  of  the  PE.   Each  PEM 
communicates  with  its  own  individual  PE  and  with  the  LOMM  data  register, 
LDR  which  acts  as  an  interface  buffer  between  the  data  main  memory,  DMM 
and  the  buffer  memory  LOMM.   An  operand  is  stored  across  in  all  the  PEs 
at  the  same  location,  each  PE  carrying  one  digit  of  the  operand.   Each 
PEM  can  be  considered  as  a  set  of  digit  wide  general  registers  for  its 
corresponding  PE.   The  loading  of  data  from  the  DMM  into  PEMs  and  storing 
of  data  from  PEMs  into  DMM  is  under  the  control  of  the  Local  Operand 
Memory  Control,  LOMCO  and  takes  place  via  buffer  data  register  LDR. 

Let  us  define  a  Buffer  Memory  Word  as  a  word  formed  by  concatenating 
the  digits  stored  across  in  the  same  location  of  all  the  PEMs  of  the 
buffer  memory.   If  the  DMM  word  length  is  different  than  the  Buffer 
Memory  Word  length,  the  logic  for  assembly  and  disassembly  of  DMM  words 
and  Buffer  Memory  words  also  form  part  of  the  buffer  memory  logic  and 
is  under  the  control  of  LOMCO.   Further,  the  operand  in  interface  buffer 
register  LDR  is  in  signed-digit  format  (each  digit  carrying  the  same 
sign  as  the  sign  of  the  operand)  while  the  DMM  word  is  in  conventional 
sign-magnitude  format.   The  format  conversion  logic  necessary  for  trans- 
formation from  sign-magnitude  to  signed-digit  form  and  vice  versa  also 
exists  in  the  data  path  between  DMM  and  the  regiser  LDR.   Each  PEM 
module  of  the  buffer  memory  has  its  own  independent  access  control  and 
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read/write  logic.   Each  PEM  is  accessed  from  two  sources:   its  associ- 
ated PE  which  fetches/stores  data  from/into  PEM  during  the  execution  of 
microinstructions  LPM/SPM;  and  LOMCO  which  accesses  the  individual  PEM 
to  read/write  a  new  buffer  memory  word  from/to  PEM  modules  into/from 
buffer  memory  data  register,  LDR.   Since  the  various  PEs  are  accessing 
their  own  PEMs  not  in  synchronism  but  rather  independently,  LOMCO  reads/ 
writes  a  buffer  memory  word  into/from  LDR  by  requesting  access  to  PEMs 
individually.   Thus,  the  access  between  DMM  and  buffer  memory  is  in 
parallel  whereas  the  individual  PEs  access  the  different  digits  of  the 
buffer  memory  word  in  serial  mode. 

The  distributed  multiport  architecture  for  LOMM  has  the  advantages 
of  modularity,  easy  expandability,  high  data  bandwidth  and  small  number 
of  pins  per  PE  for  fetching  and  storing  of  operands.   Each  PEM  module 
in  LOMM  is  a  random  access  monolithic  memory.   Semiconductor  memory 
chips  of  N  bit  capacity  are  organized  as  N  x  1  words  because  it  leads 
to  minimum  number  of  leads  and  also  this  organization  makes  feasible 
very  effective  use  of  error  correcting  codes  and  software  diagnostic 
routines  to  detect  and  overcome  component  failures.  Each  PEM  module 
can  be  assembled  from  these  chips — as  many  as  the  bit  width  of  the  PE. 
Thus  having  a  separate  PEM  module  associated  with  each  PE  provides  the 
necessary  flexibility  and  modularity.   Secondly,  the  small  word  size  of 
a  PEM  module  permits  faster  access  and  cycle  times.   Multiport  architec- 
ture permits  concurrent  operation  of  the  various  PEM  modules  and  thus  high 
data  rate  and  data  bandwidth.   Multiport  organization  and  the  communica- 
tion of  each  PE  with  only  its  own  PEM  module  allows  each  PEM  module 
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to  be  of  small  word  capacity  and  thus  reduces  the  number  of  pins  required 
for  addressing  a  location  in  PEM.   Each  PEM  module  can  be  logically  con- 
sidered as  an  extension  of  the  regiser  file  of  each  PE. 

5.3  A  Description  of  Buffer  Memory  Control 

The  operation  of  the  buffer  memory  is  under  the  control  of  LOMCO. 
The  function  of  LOMCO  is  two  fold: 

a)  When  LOMCO  is  presented  with  a  DMM  address  of  an  operand  by  the 
GACU,  LOMCO  replies  back  with  the  PEM  address  of  the  operand,  if  the 
operand  is  available  in  the  buffer  memory.  However,  if  the  operand  is 
not  available  in  the  buffer  memory,  LOMCO  searches  for  an  empty  location 
in  the  buffer  memory,  fetches  the  operand  from  the  DMM,  loads  the  operand 
in  the  empty  location  of  buffer  memory  and  replies  back  with  the  address 
of  the  location  just  loaded. 

b)  Another  function  of  LOMCO  is  to  store  away  the  contents  of 
those  locations  in  buffer  memory  which  are  no  longer  being  used  by  any 
of  the  PEs  of  the  Mantissa  Processing  Logic  to  make  space  for  new 
operands  that  may  be  needed  by  other  arithmetic  machine  instructions. 

The  LOMCO  achieves  the  above  two  functions  by  a  table  look-up 
mechanism.   There  are  as  many  entries  in  the  control  table  as  there  are 
locations  in  the  buffer  memory — one  entry  for  each  location.   Each  entry 
in  the  control  table  has  five  fields  as  shown  in  Figure  5.2.  The  fields 
labeled  DMMA  and  PEMA  respectively  denote  the  DMM  operand  address  and 
the  corresponding  PEM  address  of  the  operand.   The  field  labeled  VCB  is 
a  one  bit  field  and  denotes  the  validity  control  bit  for  the  data  stored 
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Figure  5.2   Structure  of  Control  Table  in  LOMCO. 
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in  the  location  specified  by  PEMA.  Whenever  an  operand  is  loaded  in 
the  buffer  memory  from  the  Data  Main  Memory,  DMM,  the  validity  control 
bit  is  set  to  logical  ' 1'  state.  When  a  DMM  operand  address  is  pre- 
sented to  the  LOMCO,  an  associative  table  look-up  is  performed.   If  there 
is  a  match  and  the  field  VCB  is  '1',  then  the  corresponding  PEMA  entry 
(i.e.,  the  address  of  the  location  in  PEM  containing  the  DMM  variable 

(operand))  is  sent  back  as  a  response.   If,  however,  there  is  no  match 

f 
or  if  the  validity  bit  is  zero  but  match  takes  place,   the  LOMCO  searches 

for  an  empty  entry  in  the  table.  An  empty  entry  in  the  table  indicates 
an  empty  location  in  the  buffer  memory.   On  finding  an  empty  location, 
LOMCO  fetches  the  operand  specified  by  DMM  operand  address,  stores  it  in 
the  empty  buffer  memory  location,  sets  the  VCB  field  to  logical  state  '1' 
and  responds  back  with  PEMA,  that  is,  the  address  of  the  operand  location 
in  PEM.   Another  field  associated  with  each  entry  is  the  Usage  Count  or 
USC  field.  This  field  is  essential  for  deciding  when  to  replace  the 
contents  of  the  corresponding  buffer  memory  location.   Its  function  and 
necessity  can  be  explained  in  the  following  way. 

In  order  that  all  the  PEs  may  be  kept  usefully  busy  in  processing 
the  microinstructions,  it  is  important  that  there  should  be  a  steady 
stream  of  microinstructions  being  issued  to  PEs  and  also  that  the  cor- 
responding operands  be  available  in  the  buffer  memory  for  processing  by 
the  PEs.  This  implies  the  use  of  an  instruction  look-ahead  unit  which 
supplies  arithmetic  instructions  to  the  GACU  of  the  arithmetic  unit. 


+ 
Such  a  case  can  occur  when  the  value  of  the  DMM  variable  is  changed 

by  some  other  unit  accessing  the  DMM  without  a  corresponding  update  in 

the  buffer  memory. 
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The  LOMCO  in  cooperation  with  GACU  loads  the  buffer  memory  with  operands 
in  advance  of  their  processing  by  the  PEs.   Since  the  PEs  in  the  Mantissa 
Processing  Logic  operate  on  the  individual  digits  of  an  operand  in 
sequence  and  different  PEs  may  be  operating  on  different  operands  or  on 
different  digits  of  the  same  operand,  an  operand  in  the  buffer  memory 
cannot  be  replaced  by  a  new  operand  as  long  as  a  PE  in  MPL  is  using  any 
digit  of  the  operand.  Moreover,  if  in  the  arithmetic  instruction  look- 
ahead  buffer  in  GACU,  there  exists  an  arithmetic  instruction  which  may 
make  use  of  the  operand  in  buffer  memory,  the  operand  should  not  be 
replaced  by  a  new  operand  in  order  to  avoid  an  unnecessary  DMM  access 
by  LOMCO.  All  these  control  functions  are  provided  through  the  Usage 
Count  field,  USC,  in  each  entry  of  the  table.   Contents  of  the  field  USC  is 
a  tally  which  is  incremented  by  one  every  time  a  match  is  obtained  with 
the  DMMA  field  in  the  table  entry  or  an  operand  is  fetched  from  DMM  into 
buffer  memory.  This  tally  in  USC  is  decreased  by  unity  every  time  a  PEM 
accessing  microinstruction  exits  from  the  last  PE  in  the  MPL.   The  PEM 
accessing  microinstructions  are  the  LPM  and  SPM  microinstructions  as 
discussed  in  Section  2.6.  When  the  tally  in  the  usage  count  field  USC 
goes  to  zero,  it  implies  that  no  PE  is  using  the  operand  in  the  corre- 
sponding buffer  memory  location  and  it  can  be  replaced  by  a  new  operand 
from  DMM,  if  necessary. 

The  final  field  associated  with  each  entry  in  the  control  table  is 
the  Store  Flag  field,  STF.  It  is  a  one  bit  field  and  is  set  to  logical 
state  '1'  every  time  the  PEM  reference  microinstruction  SPM  exits  from 
the  last  PE  in  MPL.  Whenever  the  tally  in  the  USC  field  goes  to  zero  and 
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field  STF  is  '1',  then  the  LOMCO  would  store  the  contents  of  all  the  PEMs 
at  the  corresponding  location  into  DMM  at  a  location  specified  by  DMMA 
field.   Note  that  only  the  'final'  value  of  an  operand  (at  the  end  of  a 
series  of  calculations  involving  that  operand)  is  stored  in  DMM  even 
though  there  may  be  a  set  of  store  microinstructions,  SPMs,  in  the  stream 
of  microinstructions  flowing  through  the  PEs.   This  is  because  there  is 
no  guarantee  without  tally  in  USC  being  zero,  that  no  PE  is  modifying 
the  contents  of  its  PEM  location  when  a  DPM  microinstruction  referring 
to  the  same  buffer  memory  location  exits  from  the  last  PE  in  MPL. 

5.4  Size  of  Buffer  Memory 

The  number  of  words  in  the  buffer  memory  and  hence  in  each  PEM 
module  is  dependent,  among  others,  on  two  main  factors. 

a)  The  maximum  possible  number  of  PEs  which  are  concurrently 
accessing  different  operands  in  the  buffer  memory  at  any  time. 

b)  The  ratio  of  the  rate  at  which  the  buffer  memory  can  be  loaded 
from  DMM  and  the  rate  of  processing  of  a  microinstruction  in  a  PE. 

In  order  that  no  PE  may  be  idle  due  to  lack  of  operands  in  the 
buffer  memory,  it  is  important  that  the  word  capacity  of  the  buffer 
memory  be  at  least  equal  to  the  maximum  number  of  PEs  which  may  at  any 
time  be  accessing  different  locations  of  the  buffer  memory.  The  number 
of  PEs  concurrently  accessing  different  buffer  memory  locations  in  turn 
depends  on  the  nature  of  the  arithmetic  instruction  stream,  the  amount 
of  arithmetic  instruction  look-ahead  in  the  GACU  and  the  number  of  PEs 
in  the  Mantissa  Processing  Logic.   If  there  are  no  data  dependencies  in 
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the  instruction  stream  and  a  constant  stream  of  microinstructions  can  be 

issued  to  the  PEs,  as  many  as  p  x  n  «_  n  PEs  may  be  accessing  different 

p+1 

operands  at  any  time  where  n  is  the  number  of  PEs  in  the  Mantissa 
Processing  Logic  and  p  is  the  number  of  operands  that  can  be  summed  by 
the  Multi  Sum  microinstruction,  MS.   Such  a  case  can  occur  in  the 
evaluation  of  an  arithmetic  expression  of  the  form 

m 
A  =  I     B 

where  B, ,  BOJ...,B  are  DMM  operands. 
1   2'     m 

However,  in  practical  cases  there  are  always  data  dependencies  in 
the  instruction  stream  and  if  n  is  large,  there  would  be  some  idle  PEs 
and  the  size  of  the  buffer  memory  required  would  always  be  less  than  n. 
From  the  empirical  program  studies  made  by  Kuck  et  al.  [47]  ,  Knuth  [48] 
and  Foster  and  Riseman   [49] ,  we  deduce  that  a  buffer  memory  of  word 
capacity  16  or  at  the  most  32  would  be  sufficient  for  all  practical 
purposes. 

The  word  capacity  of  the  buffer  memory  would  also  depend  on  the 
number  of  pins  available  in  a  PE  for  addressing  its  PEM  module  in  the 
buffer  memory. 

In  our  design  of  the  microinstructions  LPM  and  SPM,  we  have  assumed 

k+1 
that  there  are  2    words  in  the  buffer  memory  because  (k+1)  bits  are 

available  in  the  microinstruction  word  for  addressing  the  PEM  module. 

In  case  the  word  capacity  of  PEM  has  to  be  large,  e.g.,  when  k  is  small, 

then  correspondingly  more  bits  would  have  to  be  assigned  in  the 
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microinstruction  word  for  the  PEM  address.   For  k  ^  3,  k+1  bits  are 
sufficient. 
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6.   IMPLEMENTATION  OF  MACHINE  ARITHMETIC  INSTRUCTIONS 

6.1  Introduction 

In  this  chapter,  we  describe  how  a  'machine'  arithmetic  instruction 
can  be  implemented  using  the  various  microinstructions  and  the  particular 
arithmetic  unit  organization  developed  in  earlier  chapters.   The  arith- 
metic unit  is  organized  to  process  floating  point  operands.   The  frac- 
tional parts  of  the  operands  are  processed  by  the  PEs  in  the  Mantissa 
Processing  Logic  and  the  exponent  arithmetic  is  performed  by  the  Exponent 
Processing  Logic.   Since  the  exponent  arithmetic  involves  simply  taking 
the  sum  or  difference  of  the  exponents  of  two  operands,  no  details  are 
given  of  the  method  of  processing  the  exponents.   Instead,  the  major 
emphasis  is  put  on  the  sequence  of  microinstructions  which  are  issued 
by  the  Mantissa  Control  Unit,  MCU,  to  process  the  'machine'  arithmetic 
instruction. 

6.2  Implementation  of  'Machine'  Arithmetic  Instructions 

6.2.1  Global  description  of  the  processing  of  a  'machine'  arith- 
metic instruction  -  When  the  instruction  look-ahead  unit  in  the  machine, 
of  which  this  Arithmetic  Unit  forms  a  part,  detects  an  arithmetic  in- 
struction, it  sends  the  arithmetic  instruction  to  the  Arithmetic  Unit 
for  processing.  The  GACU  part  of  the  local  arithmetic  control  acts  as 
the  interface  between  the  arithmetic  unit  and  the  instruction  look-ahead 
unit.   On  receiving  the  machine  instruction,  the  GACU  calls  upon  the 
buffer  memory  control,  LOMCO  to  provide  the  GACU  with  the  buffer  memory 
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address  of  the  data  operand  (referred  in  machine  arithmetic  instruction). 
(If  the  data  operand  is  not  present  in  the  buffer  memory,  L0MC0  will 
fetch  the  operand  and  then  provide  the  buffer  memory  address  as  ex- 
plained earlier  in  Chapter  5.)   The  GACU  now  sends  the  machine  instruc- 
tion, with  the  DMM  operand  address  replaced  by  the  corresponding  buffer 
memory  address,  to  the  MCU.   MCU  decodes  the  arithmetic  instruction  and 
calls  upon  the  exponent  control  unit,  if  necessary,  to  calculate  the  sum/ 
difference  of  the  exponents  of  the  operands  involved.   The  micro- 
instructions for  the  processing  of  mantissas  are  then  issued  either 
concurrently  with  exponent  processing  or  on  response  from  the  exponent 
control  unit  depending  on  the  arithmetic  instruction.   (In  the  case  of  an 
Add  or  Subtract  instruction,  the  difference  of  the  exponents  must  be  avail- 
able for  operand  alignment  before  mantissa  processing  can  begin.)   The 
sequence  of  microinstructions  issued  by  the  MCU  for  the  mantissa  process- 
ing depends  on  the  OP-code  (for  example,  Add,  Subtract,  Multiply,  etc.) 
of  the  arithmetic  instruction  and  whether  the  instruction  is  single 
address,  double  address,  or  the  three  address  type.   In  the  following 
discussion,  a  single  address  type  of  'machine'  arithmetic  instruction 
is  assumed.   That  is,  each  instruction  is  of  the  format  OP-code  ,  EA 
where  OP-code  field  carries  the  mnemonic  and  EA  is  the  effective  DMM 
address  of  the  operand.   The  other  operand,  if  necessary,  is  implicitly 
assumed  to  be  in  the  Accumulator  which  is  distributed  in  the  PEs  of  the 
Mantissa  Processing  Logic.   The  Register  INR1  of  the  register  file  acts 
as  the  Accumulator  register  in  each  PE. 
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MCU  also  monitors  the  development  of  any  exceptional  conditions 
like  exponent  overflow/underflow,  etc.  during  arithmetic  instruction 
processing  and  reports  such  an  occurrence  to  the  GACU.   GACU  in  turn  may 
pass  this  status  information  to  other  parts  of  the  machine;  e.g.,  in- 
struction look-ahead  and  fetch  unit  for  branching  and  other  decisions. 
Due  to  the  most-significant -digit-first  nature  of  arithmetic  processing, 
the  occurrence  of  singular  conditions  can  be  detected  by  MCU  before  the 
execution  of  arithmetic  instruction  is  fully  complete. 

6.2.2  Floating  point  Addition  -  Let  the  Floating  Point  ADD  instruc- 
tion be  given  by  FPA  ,  EA  where  EA  is  the  effective  DMM  address  of  the 
Addend  operand.   Let  ea  be  the  corresponding  buffer  memory  address  of 
the  Addend.  Once  the  buffer  memory  address  ea_  is  known,  the  MCU  calls 
upon  the  Exponent  Control  Unit,  ECU,  to  calculate  the  difference  of 
exponents  of  the  operands  in  Accumulator  and  ea,  and  issues  the  micro- 
instructions for  the  right  shift  of  the  appropriate  operand  to  align 
the  operands,  the  summation  of  the  operands  and  finally  checks  for  man- 
tissa overflow  and  takes  the  necessary  corrective  action.   This  could  be 
followed  by  normalization,  assimilation  to  conventional  form  and  then 
storing  of  the  result  operand. 

6.2.2.1  Mantissa  processing  microprogram  -  Assuming,  without  loss 
of  generality,  that  the  Addend  is  to  be  shifted  for  operand  alignment, 
the  sequence  of  microinstructions  issued  by  MCU  is  shown  below. 
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RS 

2 

0 

RS 

2 

0 

RS 

• 
• 

2 

0 

SS 

Microinstruction     comments 

LPM  2   ea   ;        INR2  *  PEM[ea] 

;        As  many  right  shift  micro- 
;        instructions  are  issued  as 
is  the  difference  in  ex- 
ponents of  the  Addend  and 
;        Augend . 

;        INR1  «-  INR1  +  INR2 

The  sequence  of  microinstructions  given  above  loads  the  addend  in  regis- 
ter INR2,  aligns  the  operands  and  then  sums  the  operands.  The  sequence 
of  microinstructions  for  normalization  and  assimilation  of  operands  is 
discussed  in  Sections  6.2.6  and  6.2.7  respectively. 

6.2.2.2  Mantissa  overflow  correction  -  In  the  d-vector  representa- 
tion of  the  sum  S 

0    12  3      n 
it  is  possible  that  |  s,J  =  1.   However,  this  is  only  an  indication  of 
potential  overflow.  A  mantissa  overflow  occurs  only  when  s~  .  s  >  0 
where  s  is  the  first  most  significant  non-zero  digit  in  the  d-vector 
representation  of  S.   Due  to  the  use  of  an  RBA-2  using  the  sign-magnitude 
logic  vector  encoding  LVE_  for  the  redundant  binary  digit,  in  the  imple- 
mentation of  microinstruction  SS,  bogus  overflow  [35]  would  occur  quite 
often.   The  bogus  overflow  occurs  whenever  s«  .  s.  <  0  because  the  sum 
can  always  be  recoded  such  that  s~   =  0.   For  example,  the  sum  1.0321 
can  always  be  recoded  into  its  algebraic  equivalent  0.9721. 

The  mantissa  overflow  can  be  corrected  by  shifting  the  sum  right  by 
one  digital  position  and  correspondingly  adjusting  the  exponent  of  the 
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sum.  In  the  case  of  bogus  overflow,  however,  this  procedure  would  cause 
a  loss  of  one  significant  digit  (k  bits  for  radix-2  arithmetic)  un- 
necessarily. However,  this  can  be  taken  care  of  during  normalization  of 
the  sum  if  the  shifted -out  digit  of  the  sum  is  saved  in  the  End  Unit  and 
reintroduced  during  left-shifting  of  the  operand.  The  left  shift  of  the 
sum  after  normalization  recoding  is  done  to  eliminate  the  leading  zeros 
in  the  recoded  sum. 

6.2.3  Floating  point  Subtraction  -  The  processing  for  Floating 
Point  Subraction  is  exactly  identical  to  that  for  Floating  Point  Addi- 
tion except  that  in  the  Mantissa  Processing  Microprogram,  the  micro- 
instruction SS  is  preceded  by  the  microinstruction  TI  2,2.   This 
microinstruction  reverses  the  sign  of  each  digit  of  the  subtrahend. 
The  sequence  of  microinstructions  for  mantissa  processing  is  as  follows. 

Microinstructions    comments 


LPM  2  ea 
RS   2  0 
RS   2  0 


RS  2  0 
TI  2  2 
SS 


INR2  «-  PEM[ea] 


operand  alignment 

INR2  i-  (-1)  .  INR2 
INR1  «-  INR1  +  INR2 


6.2.4  Floating  point  Multiplication  -  Let  the  Floating  Point 
Multiply  instruction  be  denoted  as  FPM  ,  EA  where  _EA  is  the  effective 
DMM  address  of  the  Multiplier  operand.   The  operand  in  the  accumulator 
is  the  implicitly  assumed  multiplicand  operand.   If  _ea  is  the  correspond- 
ing buffer  memory  address  of  the  multiplier  operand,  the  processing  for 
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Floating  Point  Multiply  instructions  involves  the  following  steps  by 
MCU.   MCU  calls  upon  the  Exponent  Control  Unit  to  sum  the  exponents  of 
the  operands  in  the  accumulator  and  at  LOEM  address  specified  by  ea; 
concurrently,  the  MCU  issues  the  sequence  of  microinstructions  to  PEs 
to  form  the  double  length  product  in  the  PEs  and  finally  to  check  for 
any  exceptional  conditions.   Mantissa  overflow  cannot  occur  because  the 
mantissas  of  both  the  operands  are  less  than  unity.  However,  exponent 
overflow  may  take  place. 

6.2.4.1  Microprogram  for  mantissa  processing  -  The  mantissa  pro- 
cessing for  the  multiplication  instruction  involves  the  generation  of 
partial  products  and  the  final  product  digits.   For  processing,  the 
multiplier  and  multiplicand  operands  are  respectively  in  file  registers 
INR3  and  INR2  whereas  the  Accumulator  register  INR1  is  used  to  form  and 
accumulate  the  partial  products.   Unlike  the  conventional  multipliers, 
the  most  significant  half  of  the  final  double  length  product  is  in  the 
Multiplier  register  INR3  and  the  least  significant  half  is  in  the  Accumu- 
lator register  INR1. 

Because  the  partial  products  are  formed  beginning  with  the  most 
significant  digits,  the  most  significant  digits  of  the  product  are 
formed  first.  Also,  to  achieve  maximum  precision  the  partial  product 
is  shifted  left  during  each  step  instead  of  the  multiplicand  being 
shifted  right.   Due  to  the  left  shift  of  the  partial  product,  two  prob- 
lems immediately  arise. 

a)   During  the  left  shift  of  the  Accumulator  register  (partial 
product) ,  not  one  but  two  digits  (the  digits  to  the  left  and  right  of 
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the  radix  point)  are  shifted  out.   These  two  digits  need  to  be  recoded 
into  a  final  product  digit  to  be  stored  into  the  multiplier  register 

and  a  residual  digit  to  be  added  to  the  next  partial  product  in  the  next 

2 
step.   Pisterzi  [30]  has  shown  that  a  recoder  with  r  states  is  necessary 

for  this  purpose.   Such  a  recoder  can  be  conceptually  looked  upon  as  an 

extension  of  the  adder  network  to  the  left.   However,  in  our  case,  due 

to  the  existence  of  bogus  overflow  in  RBA-2,  the  basic  cell  of  the 

adder  MIRBA,  the  recoder' s  logic  design  would  have  to  be  different. 

b)   Another  problem  due  to  left  shifting  of  the  partial  product  is 
that  the  digits  of  the  most  significant  half  of  the  final  product  which 
become  available  one  by  one  as  the  output  of  the  recoder  (connected  to 
the  most  significant  digital  position  of  the  adder)  need  to  be  stored 
in  the  multiplier  register  and /or  the  buffer  memory.   In  order  that 
these  product  digits  may  be  stored  in  proper  order  in  the  multiplier 
register,  the  product  digit  (output  of  recoder)  needs  to  be  stored  in 
the  least  significant  digital  position  of  the  multiplier  register  because 
this  position  is  vacant  due  to  the  left  shift  of  the  multiplier  reg- 
ister.  But  MCU  communicates  with  and  knows  the  state  of  only  the  most 
significant  PE.   The  solution  to  this  problem  is  to  send  the  value  of 
the  digit  to  be  placed  in  the  least  significant  digital  position  with 
the  left  shift  multiplier  microinstruction.  As  a  matter  of  fact,  the 
particular  definition  of  the  left  shift  microinstruction  was  contrived 
to  serve  specially  this  purpose  only. 

The  MCU  forms  the  partial  products  by  issuing  a  sequence  of  a  set 
of  three  microinstructions  as  many  times  as  the  number  of  digits  in  the 
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multiplier  operand.   The  three  microinstructions  are  'Left  Shift  Multi- 
plier' to  examine  the  multiplier  digit,  'Form  Multiple  and  Add'  to  form 
the  partial  product  and  'Left  Shift  Accumulator'  to  shift  the  partial 
product.   The  microprogram  for  the  formation  of  the  double  length  product 
for  six  digit  long  multiplier  and  multiplicand  operands  is  given  below, 
m  and  P.  (i=l,2,...,6  and  j=l,2, ... ,11,12)  respectively  denote  multi- 
plier and  Product  digits. 


Microinstruction 


LPM 
TD 
LDC 
LS 

FMA 

LS 

LS 

FMA 

LS 
LS 


3 
1 
1 
3 

m] 
1 

3 

m, 

1 
3 


FMA  m. 


LS 
LS 


FMA  m, 


LS 
LS 


FMA  m, 


LS 
LS 


FMA  m. 


LS 
LS 


ea 
2 
0 
0 


comments 

INR3  «-  PEM[ea]  (e  Multiplier) 
INR2  4-  INR1    (e  Multiplicand) 
INR1  *■   0 


MCU 


m, 


INR3  «-  r.INR3 


INR1  +   INR1  +  INR2 


MCU 

MCU 


INR2 


i  m. 
r.INR2 


m2,INR36  +■  P1,  INR3 


r.INR3 


These  achieve  the  partial  products 
for  the  rest  of  the  multiplier 
digits  nu,  m~,...,  m,  in  the  same 

way  as  the  sequence  of  immediately 
preceding  four  microinstructions 


A  pictorial  representation  of  the  flow  and  execution  of  the  above 
sequence  in  a  6  PE  Mantissa  Processing  Logic  is  shown  in  Figure  6.1.   In 
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this  figure, 


P.,   is  the  jth  digit  of  the  portion  of  the  ith 
accumulated  partial  product  which  is  in  the 
Accumulator  register  and 


P.   is  the  j   digit  of  the  final  product  and  is 

P.   =   .P,     1  <  j  <  6 
J     J  1       -  J  - 

Pj  -  6PJ-5   '<UU 

P12  "  ° 

The  column  labeled  'Register  in  MCU'  is  a  one  digit  wide  register  which 
holds  the  multiplier  digit  when  the  multiplier  register  is  shifted  left 
for  examining  and  selecting  the  next  multiplier  digit  for  partial  pro- 
duct formation.   This  register  is  also  used  to  hold  the  product  digit, 
from  the  output  of  the  accumulator  overflow  recoder,  for  storing  in  the 
least  significant  digital  position  of  the  multiplier  register  via  the 
Left  Shift  Accumulator  (LS  3, P.)  microinstruction. 

If  the  multiplier  digit  m'  in  the  microinstruction  FMA,mT  is  to 
have  a  redundancy  ratio  6  <_2/3,   then  the  two  consecutive  digits  m  and 
m    of  the  multiplier  register  have  to  be  examined  to  generate  one 
modified  multiplier  digit  mT.   The  algebraic  design  of  such  a  redundancy 
recoder  is  given  in  the  Appendix  A-l.   The  sequence  of  microinstructions 
for  mantissa  processign  when  the  multiplier  digit  redundancy  is  5  <_  2/3 
would  remain  the  same  as  before  except  that  the  fourth  microinstruction 
LS  3,0  will  be  immediately  followed  by  another  LS  3.0  microinstruction 
in  order  to  bring  m?  to  the  recoder.   For  the  rest  of  the  modified 
multiplier  digits  mT  (j  >  1),  digit  m.+,  only  needs  to  be  brought  into 
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MCU  because  m.  is  already  known  from  the  previous  step.   Note  that  there 
may  be  one  more  modified  multiplier  digit  than  the  number  of  multiplier 
digits  in  the  original  multiplier  operand  to  maintain  algebraic  equiv- 
alence in  the  two  forms  of  the  same  multiplier  operand. 

6.2.5  Floating  point  Division  -  The  processing  for  Floating  Point 
Division  is  almost  identical  to  that  for  Multiplication  except  that  the 
quotient  digits  must  be  determined  by  examination  of  the  partial  remainder. 
Division  is  performed  by  repetitive  additions  and  shifts.   In  the  Float- 
ing Point  Divide  instruction  FPD  ,  EA,  the  effective  DMM  address  of  the 
divisor  operand  is  given  by  EA  and  the  Dividend  is  implicitly  assumed  to 
be  the  operand  in  the  Accumulator.   The  processing  by  MCU  for  Floating 
Point  Divide  involves  calling  upon  the  Exponent  Control  Unit  to  take  the 
difference  of  the  exponents  of  the  dividend  (accumulator)  and  the  divisor 
at  the  buffer  memory  address  ea,  and  processing  of  the  mantissa  to  cal- 
culate the  quotient  digits.  The  exceptional  conditions  are  the  possible 
exponent  underflow  and  a  zero  value  of  the  divisor. 

6.2.5.1  Microprogram  for  mantissa  processing  -  The  major  problems 
in  the  implementation  of  Division  in  the  arithmetic  unit  under  consider- 
ation are: 

a)  the  storage  of  double  precision  dividend, 

b)  the  calculation  of  the  quotient  digits, 

c)  the  placement  of  quotient  digits  in  the  PEs,  and 

d)  the  extension  to  the  left  of  the  Accumulator  and  the  Adder 
Network  to  take  care  of  a  shifted  partial  remainder. 
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We  would  not  discuss  the  above  problems  in  detail  except  indicating  the 
possible  solutions.   For  details,  the  reader  should  refer  to  Pisterzi 
[30] .  The  double  precision  dividend  is  stored  in  two  registers — the 
Accumulator  register  INR1  and  the  multiplier  register  INR3.   The 
Accumulator  register  INR1  holds  the  most  significant  half  of  the  divi- 
dend and  INR3  holds  the  least  significant  half.   At  the  end  of  the 
processing,  they  respectively  hold  the  remainder  and  the  quotient. 
Register  INR2  will  hold  the  divisor. 

Because  of  the  redundant  number  representation  for  the  quotient 
digit,  the  quotient  digit  can  be  calculated  by  a  'model  division'  [50] 
which  uses  only  truncated  version  of  the  divisor  and  shifted  partial 
remainders.   It  is  shown  in  the  Appendix  A-2  that  for  radix  2   (k  >_  3)  , 
3  digits  of  the  divisor  and  2  digits  of  the  fractional  part  in  addition 
to  the  integer  part  of  the  shifted  partial  remainder  are  sufficient  for 
the  calculation  of  the  quotient  digit  with  redundancy  ratio  of  2/3  or  1. 
However,  for  radix-4,  one  more  digit  each  of  the  divisor  and  partial 
remainder  are  necessary  if  the  quotient  digit  has  redundancy  ratio 
of  2/3.   But  for  maximally  redundant  quotient  digits,  we  use  the  same 
number  of  digits  as  for  k  >_  3.   The  examination  of  the  operand  digits 
for  quotient  calculation  in  the  MCU  is  done  by  shifting  the  operands 
left  as  many  times  as  the  number  of  digits  necessary.   Examination  of 
the  divisor  digits  needs  to  be  done  only  once  at  the  beginning  since 
the  same  digits  take  part  in  the  calculation  at  every  step  of  a  quotient 
digit.   However,  since  the  partial  remainder  changes,  it  has  to  be 
shifted  every  time.   But  since  the  unshifted  divisor  and  the  radix-r 
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shifted  partial  remainder  are  necessary  in  the  PEs  for  calculation  of 
a  new  partial  remainder,  the  examination  of  the  operand  digits  is  done 
by  shifting  another  register  which  contains  a  copy  of  the  operand  whose 
digits  are  to  be  examined.   File  register  INR4  can  be  used  for  that 
purpose. 

The  quotient  digit  is  stored  in  the  vacant  least  significant 
digital  position  of  the  register  INR3  by  using  the  Left  Shift  microin- 
struction just  as  in  the  case  of  Multiplication.   The  quotient  digit  is 
sent  in  the  modifier  field  of  the  Left  Shift  microinstruction  issued  by 
the  MCU. 

Because  of  the  characteristics  of  the  Division  process,  the  shifted 
partial  remainder  in  the  accumulator  would  always  be  less  than  r  in 
absolute  magnitude.  Thus  the  overflow  recoder  that  was  used  in  the 
Multiplication  process  can  be  used  to  store  the  integer  part  of  the 
shifted  partial  remainder. 

Note  that  the  technique  used  for  the  'Model'  Division  is  completely 
independent  of  the  architecture  of  our  Arithmetic  Unit.  It  can  be  done 
by  Table  look-up  or  by  any  other  method  depending  on  the  time  and  cost 
considerations.  Pure  table  look-up  is  too  expensive  for  any  reasonable 
radix  greater  than  A.  We  propose  that  the  quotient  digit  be  calculated 
in  MCU  serially  one  bit  at  a  time  for  radix  >_  8  and  then  assembled  into 
a  radix-r  digit  before  calculating  the  next  partial  remainder. 

The  sequence  of  microinstructions  for  Floating  Point  Division  is 
very  similar  to  that  for  Multiplication  and  hence  would  not  be  given 
here. 
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6.2.6  Normalization  of  operands  -  An  operand  is  considered  normal- 
ized if  it  satisfies  the  definition  3,  given  in  Section  3.5.1,  of  a 
normalized  number.   The  major  steps  in  the  normalization  process  are 
left  shifting  of  the  signed-digit  operand  till  there  are  no  leading 
zeros,  recoding  the  shifted  operand  by  the  'Normalize  Recode'  micro- 
instruction, NR  and  finally  left  shifting  the  recoded  operand  to  remove 
the  leading  zeros,  if  any  were  created  by  the  microinstruction  NR. 

Because  of  the  interface  control  signal  Z-  between  PE  and  the  MCU, 
there  is  no  need  to  launch  a  Left  Shift  microinstruction  to  examine  the 
leading  digit  for  zero  magnitude.   Simply  monitoring  of  Z1  is  sufficient 
and  this  has  the  advantage  that  no  overshift  of  the  operand  would  take 
place  during  normalization  process. 

Note  that  since  the  microinstruction  NR  operates  only  on  the  operand 
in  the  Accumulator  register  INR1,  the  operand  should  be  placed  in  INR1 
for  the  normalization  process. 

6.2.7  Assimilation  of  signed-digit  operand  -  The  process  of  Assim- 
ilation converts  the  signed-digit  operand  whose  different  digits  carry, 
in  general,  different  signs  to  a  form  in  which  each  digit  has  the  same 
sign.   This  sign  is  the  sign  of  the  operand.  The  procedure  and  the  se- 
quence of  microinstructions  necessary  for  Assimilation  is  identical  to 
that  for  Normalization  except  that  the  microinstruction  AR  instead  of 

NR  is  launched  for  recoding  the  operand. 
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7.   SUMMARY  AND  CONCLUSIONS 

7 .1   Summary  and  Discussion  of  Results 

Chapter  1  described  the  characteristics  and  the  constraints  of  the 
newly  emerging  technology  of  Large  Scale  Integration  (LSI)  and  its  impli- 
cations for  the  design  of  a  digital  system.   Based  on  this  discussion,  a 
set  of  desirable  characteristics  for  an  Arithmetic  Unit  were  formulated 
and  a  limited  interconnection  arithmetic  unit  as  proposed  by  Pisterzi 
[30]  was  chosen  as  a  vehicle  to  study  the  arithmetic  and  logic  design 
aspects  of  the  basic  module  of  such  an  arithmetic  unit. 

Chatper  2  described  briefly  the  logical  organization  and  mode  of 
operation  of  the  arithmetic  unit — especially  of  the  Mantissa  Processing 
Logic  (MPL) .   The  MPL  is  composed  of  a  linear  cascade  of  identical  logic 
modules  called  Processing  Elements  (PEs)  which  execute  a  sequence  of 
microinstructions  issued  to  MPL  by  the  Mantissa  Control  Unit  (MCU) . 
The  MCU  is  an  interpreter  for  the  'machine'  arithmetic  instructions  like 
'Multiply',  'Add',  etc.,  and  issues  a  sequence  of  microinstructions  for 
processing.   The  salient  feature  of  the  processing  in  MPL  is  that  a 
microinstruction  issued  by  the  MCU  is  not  broadcasted  to  all  the  PEs  in 
the  MPL.   Instead,  the  microinstruction  is  executed  by  the  PEs  in  se- 
quence starting  with  the  most  significant  PE.   (The  'significance'  of 
a  PE  is  the  same  as  the  arithmetic  significance  of  the  operand  digit 
contained  in  that  PE.)   The  method  of  processing  in  the  MPL  was  illus- 
trated by  an  example  which  showed  how  the  various  microinstructions 
in  the  microinstruction  stream  could  be  pipelined.   This  pipelining 
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feature  allows  the  meshing  in  of  machine  arithmetic  instructions  even 
before  all  the  result  digits  of  a  previous  machine  instruction  have  been 
calculated.   The  discussion  in  this  chapter  forms  the  framework  for  the 
material  in  the  subsequent  chapters. 

Chapter  3  is  concerned  with  the  arithmetic  design  of  the  Processing 
Element.   Due  to  the  digit  serial  nature  of  the  arithmetic  processing 
and  the  desirability  of  limited  intercommunication  between  PEs,  redundant 
number  system  is  a  necessity.   The  number  system  was  chosen  to  be  Signed 
Digit  and  maximally  redundant  firstly  because  the  conversion  from  the 
conventional  number  representation  of  sign  and  magnitude  to  the  maximally 
redundant  and  vice  versa  is  very  simple,  and  secondly,  the  radix-2 
arithmetic  can  be  realized  in  terms  of  identical  stages  of  redundant 
binary  {1,0,1}  arithmetic  structures.   This  gives  the  required  repetitive 
and  uniform  logical  structure  to  the  internal  logic  of  the  PE.   Then  a  set 
of  simple  arithmetic  microinstructions  sufficient  for  the  implementation 
of  four  basic  arithmetic  operations  are  defined  and  their  digit  algorithms 
are  described  by  their  arithmetic  transfer  functions  and  their  algebraic 
implementation.   The  particular  algebraic  implementation  of  the  digit 
algorithm  is  influenced  by  LSI  technology  constraints  of  regularity  of 
logic  structure,  simplicity  of  the  basic  cell  of  the  logic  structure 
and  the  least  number  of  pins  for  the  module.  The  regularity  of  logical 
structure  is  obtained  by  implementing  the  radix-2  multi-input  adder  as 
a  linear  cascade  of  k  multi-input  redundant  binary  adder.   Each  multi-input 
redundant  binary  adder  in  turn  is  implemented  as  a  tree  structure  of  still 
simpler  2  inputs  or  3  inputs  redundant  binary  adders.   A  definition 
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of  normalized  operands  was  developed  and  its  influence  on  the  arithmetic 
properties  of  overall  processing  and  the  complexity  of  quotient  digit 
calculation  was  also  discussed.   The  definition  for  normalized  operands 
chosen  was  such  that  processes  of  'normalization'  and  'assimilation' 
could  share  the  same  logic. 

Chapter  4,  which  is  the  major  contribution  of  this  thesis,  treats 
the  logic  design  of  the  Processing  Element.   In  this  chapter,  the  gate 
complexity  and  pin  complexity  of  the  Processing  Element  are  shown  to  be 
related  to  the  bit  width  of  the  Processing  Element  (radix  of  arithmetic 
processing  in  MPL)  and  the  redundancy  in  the  multiplier/quotient  digit 
used  to  form  the  partial  products  in  the  process  of  multiplication. 
The  major  components  of  the  Processing  Element  are  the  Register  File  for 
the  storage  of  active  operands,  the  Digit  Processing  Logic  which  is  es- 
sentially a  combinational  logic  network  for  the  data  transformation,  and 
the  Processing  Element  Control  which  receives  and  decodes  the  microin- 
struction and  generates  the  necessary  sequence  of  control  signals  to 
condition  the  combinational  network  DPL.   The  number  of  gates  and  pins 
required  for  the  DPL  are  very  strongly  dependent  on  the  bit  width  of  the 
Processing  Element  whereas  the  number  of  gates  and  pins  required  for  the 
PE  control  is  almost  independent  of  the  bit  width  of  the  module.   From 

the  inspection  of  Tables  A. 3  and  4.5,  it  is  clear  that  'local  generation' 

p 
of  collective  Product  Transfer  t.  should  be  used  to  keep  down  the  number 

of  pins  necessary  on  the  PE  module. 

An  examination  of  Tables  4.2  and  4.4,  which  give  respectively  the 

number  of  gates  required  in  the  implementation  of  DPL  for  multiplier 
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digit's  redundancy  ratio  of  1  and  2/3,  leads  to  the  conclusion  that 
redundancy  ratio  of  2/3  should  be  employed  for  the  multiplier  and  quotient 
digit.   This  would  require  the  existence  of  a  multiplier  digit  recoder  in 
MCU  because  the  digits  of  the  multiplier  operand  in  the  MPL  have  redund- 
ancy ratio  of  unity.   But  the  multiplier  digit  recoder  is  very  simple. 
A  still  further  advantage  of  restricting  the  redundancy  ratio  of  the  multi- 
plier/quotient digit  to  <  2/3  is  that  a_,. — the  number  of  PEs  which  must 

—  FMA 

cooperate  with  a  given  PE  in  the  execution  of  microinstruction  FMA — would 
always  remain  2  irrespective  of  the  radix  of  the  multiplier  digit,  when 
the  MIRBAs  in  MIAD  are  implemented  as  a  log-sum  tree  of  RBA-2s  only. 
This  can  be  seen  from  the  Table  7.1. 


Table  7.1  Values  of  a  and  a.  when  the  multiplier/quotient 
digit  redundancy  ratio  is  1/2  <  6  <  2/3 
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5 
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2 
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8 

5 

6 

2 
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Finally,  inspection  of  Table  4.5  shows  that  the  DPL  requires  only 
36  pins  for  radix-256,  that  is,  for  a  8  bit  width  of  the  PE  module. 
Since  the  PE  control  requires  36  pins  also,  an  eight  bit  wide  PE  module 
should  be  employed  in  the  Mantissa  Processing  Logic  in  order  to  balance 
the  arithmetic  processing  cost  in  DPL  and  PE  control  cost.   This  requires 
a  total  of  72  pins  on  the  PE  module  package  and  which  is  by  no  means  un- 
reasonable by  the  standards  of  today's  technology. 

A  negative  aspect,  from  the  LSI  viewpoint,  of  the  structure  of 
Mantissa  Processing  Logic  should  be  noted  here.   Since  the  microinstruc- 
tion flow  from  one  PE  to  the  other  instead  of  being  broadcast  from  the 
MCU,  the  number  of  pins  required  for  the  PE  control  are  doubled  in  the 
present  structure.  Moreover,  the  request-response  strategy  of  PE  co- 
ordination control  also  doubles  the  number  of  pins  required  compared  to 
a  synchronous  control  synchronized  to  a  central  clock.  However,  the 
asynchronous  control  has  the  advantage  that  any  number  of  PEs  can  be  con- 
catenated together  more  easily  to  achieve  any  desired  precision  without 
worrying  about  the  clock  skew  problems.   It  should  be  noted,  however, 
that  the  arithmetic  and  logic  design  of  the  DPL  as  described  in  this 
thesis  is  independent  of  the  nature  of  PE  control  and  the  same  DPL 
design  can  be  used  to  design  a  PE  module  for  a  bus-structured  and 
synchronous  Mantissa  Processing  Logic. 

In  Chapter  5,  a  brief  description  was  given  of  the  logic  organiza- 
tion and  structure  of  a  buffer  memory  which  acts  as  an  interface  between 
the  arithmetic  unit  and  the  Data  Main  Memory.   The  major  characteristic 
of  the  buffer  memory  is  that  communication  between  the  buffer  memory  and 
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Data  Main  Memory  is  on  word  level  whereas  the  communication  between  the 
buffer  memory  and  Mantissa  Processing  Logic  is  on  a  digit  serial  basis. 
It  was  further  argued  that  the  size  of  the  buffer  memory  in  words  is 
fairly  small — of  the  order  of  16  to  32  words. 

Chapter  6  showed  how  various  machine  arithmetic  instructions  could 
be  implemented  using  the  microinstructions. 

7 .2   Suggestions  for  Further  Work 

Reliability  and  availability  considerations  were  not  addressed  in  this 
thesis.   Since  microinstructions  flow  from  any  PE  to  its  adjacent  PE,  it 
is  important  that  all  the  consecutive  PEs  operate  properly  in  order  for 
the  Arithmetic  Unit  to  operate  properly.   Determining  organizational 
modifications  in  the  interconnection  structure  of  the  PEs  which  would 
facilitate  the  automatic  reconfiguration  of  properly  operating  PEs  to 
yield  a  working  Arithmetic  Unit  with  degraded  performance,  in  the 
presence  of  faulty  PEs,  is  a  very  important  area  of  further  investigation. 

Because  the  processing  in  the  Arithmetic  Unit  takes  place  on  a 
digit-by-digit  basis  starting  with  the  most  significant  digit,  this 
Arithmetic  Unit  structure  has  a  potential  for  implementing  a  dynamically 
varying  precision  arithmetic.   But  due  to  the  possibility  of  different 
PEs  working  concurrently  on  digits  of  different  operands,  certain  struc- 
tural modifications  would  be  necessary.   Investigation  of  such  modifica- 
tions is  another  interesting  area  of  investigation.   One  possible  solution 
may  be  the  use  of  some  kind  of  'end-of-the-word'  marker  as  the  delimiter 
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for  the  precision  of  the  operands  and  the  use  of  a  bus-structure  to  in- 
form the  MCU  when  the  last  digits  of  the  operands  have  been  operated  on, 

A  simulation  of  the  Arithmetic  Unit  using  data  from  real  programs 
would  be  interesting  and  useful  to  determine  the  useful  word  capacity 
of  a  PEM  module. 

Finally,  the  logic  design  of  the  MCU  and  the  GACU  should  be  per- 
formed to  determine  the  actual  gate  complexity  of  this  module. 
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APPENDIX  A-l 

ALGEBRAIC  DESIGN  OF  A  RIGHT-DIRECTED  RECODER  TO  CHANGE 
MULTIPLIER  DIGIT'S  REDUNDANCY  FROM  6  -  1  to  6  <  2/3+ 


This  recoder  changes  the  multiplier  operand 


M  =  0  .  m,  m~  nu  ...  m,  ,  m.  m,  ,  ...  m 
12  3      j-1  j   j+1      n 


where 


|m  |  <  (r-1)      Y^  =  0,l,...n. 


to  an  algebraically  equivalent  operand 


M'=  ml    .  m'  m'  m'  ...  m'  ,  m'  m'  ,  ...  m' 
0    12  3      J-1  j   j+1      n 


such  that 


I  mn  |  <^  1    and 


I- II 


f  (r_1) 


jyj  =  0,1,  ...n, 


In  order  to  do  the  above  recoding  serially  on  a  digit-by-digit 
basis,  starting  from  the  most  significant  digit,  one  needs  to  know  only 
the  digit  to  the  immediate  left  of  the  digit  to  be  recoded  in  addition 
to  the  digit  itself.   For  example,  if  m  is  the  digit  to  be  recoded, 
then  the  algebraic  design  of  the  recoder  is  given  by  the  following 


Strictly  speaking,  6  = 
larger  than  2/3. 


f  (r-1) 


^tr-1)  which  may  be  slightly 


m, 
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M 


m 


J-l 


w 


j-i 


Vi 


m. 
J 


w, 


Each  digit  m  and  m    is  first  recoded  into  a  pair  of  digits  {t   , ,  w 
J      3     .  j-1   j 


and  {t._  ,  w._.}  so  that 


Vl   =   r  Cj-2  +  "j-l 


■  r  >  4 


m, 


=  r  t .  .  +  w. 

J-l    J 


and 


m.  >_  |  (r-1) 


t  .   =  (0  otherwise 


1   if  m.<  - 
l— 


f  (r-1) 


»  i  -  j.  j-l 


f  (r_1) 


-1 


<  w.  < 

—   l  — 


j  (r-1) 


-1 


The  recoded  digit  m".    is  given  by 


nK   =  w.  ,  +  t ,  , 

J      J-l    j-l 
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The  above  recoder  is  applicable  for  all  values  of  index  j .   Note  that 
m'  cannot  have  a  magnitude  greater  than  1  and  the  recoded  multiplier 
operand  may  have  one  digit  extra  compared  to  the  original  operand. 


236 


APPENDIX  A-2 
PRECISION  REQUIREMENTS  FOR  QUOTIENT  DIGIT  CALCULATION 

According  to  the  analysis  by  Atkins  [50]  based  on  P-D  plot  con- 
siderations, the  worst  case  precision  of  the  operands  required  for 
quotient  digit  calculation  is  given  by  the  relation 


AP|  <  Df n  (6  -  Jj)  -  iM  (n_i  +  6)  (A2.l) 


where 


AP  =  truncation  error  in  the  left  shifted  (by  one  digital  posi- 
tion) partial  remainder 

Ad   =  truncation  error  in  the  divisor 

n  =  maximum  allowed  value  of  the  quotient  digit 

6   =  redundancy  ratio  of  the  quotient  digit 

n 


r-1 


and 


D    =  minimum  value  of  the  truncated  divisor 


Let     I  Ad  I  =  r  n 


and     I  API  -r-^ 


where     ft  =  number  of  digits  in  the  truncated  divisor 

Case  1:   6=1 

For  a  maximally  redundant  quotient  digit, 
6  =   1 
n  =   (r-1) 
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and  for  a  maximally  redundant  divisor  normalized  according  to  Definition  3 
given  in  Section  3.5, 


'min 


III     ± 

2       ft 
r      r 


Substituting  the  above  values  in  Equation  (A2.1),  we  have 


r      r 


r      r 


-ft 


-ft 
V  (r-1) 


-1  +  1) 


which  simplifies  to 


^     <   (r-D 


r2(3r-2) 


(A2.2) 


Values  of  ft  which  satisfies  the  relation  (A2.2)  for  different  values  of 
r  are  tabulated  in  Table  A.l. 

Table  A.l 
Values  of  ft  Vs  Radix  and  Redundancy  Ratio  of  a  Quotient  Digit 


RaH-fv 

Redundancy  ratio,  i 

»,  of  a  quotient  digit 

r 

6  =  1 

6  <_  2/3 

2 

4 

- 

4 

3 

4 

8 

3 

3 

16 

3 

3 

32 

3 

3 

64 

3 

3 
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Case  2:      5=2/3 

In  this  case, 


f  (r-1) 


~min     _     r-1 1_ 

t  2  0. 

r  r 


-(ft-1)  _     ,r-l         1    w   n         ,  .        r        ,      .    ,      n  N 
1    (—  +  -  )(^i   "  h)    -  —  (n-l  +  ^) 


which   simplifies   to 


jfl         <_     2*  -   Cr-1)      .     Eli CA2.3) 

r  2r(r-l)   +  n(r-2) 


Values  of  ft  which  satisfies  the  relation  (A2.3)  for  different  values  of 
r  are  given  in  Table  A.l. 

Table  A.l  clearly  shows  that  3  digits  of  the  divisor  and  2  most 
significant  digits  of  the  fractional  part  of  the  shifted  partial  remainder 
in  addition  to  its  (shifted  partial  remainder's)  integer  part  are  suffi- 
cient to  calculate  the  quotient  digit.   It  can  be  further  shown  that  all 
the  bits  of  the  last  digits  of  the  truncated  divisor  and  partial  remainder 
are  not  necessary  for  the  quotient  digit  calculation. 
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